Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize single node selecting in Shrink Action of ILM #76206

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

gaobinlong
Copy link
Contributor

Relates to #67957.

The main changes of this PR are:

  1. Optimize node selecting in SetSingleNodeAllocateStep of ILM's Shrink Action, the processes are here:
    (1) Get node stats, only include fs info and index store stats
    (2) Calculate each node's shard storage bytes of the source index
    (3) Accumulate all nodes' shard storage bytes, to get the source index's primary shards storage bytes.
    (4) The nodes which can be allocated two copies of the index's primary shards will be selected, that's because if the file system doesn’t support hard-linking, then all segments are copied into the new shrunken index, the disk must have free bytes below the low watermark to make sure the new shrunken index can be initialized successfully.
    (5) Select the best node which contains the maximum shards storage bytes of the source index from the nodes list generated by step (4), because we want to reduce data transfer cost as much as possible.
    (6) If we cannot find a node which contains any shard of the source index, then shuffle the valid node list and select a node randomly.

  2. Add some test methods for the changes above.

@elasticsearchmachine elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 6, 2021
@dakrone dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss labels Aug 12, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Aug 12, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@dakrone dakrone self-requested a review August 19, 2021 15:49
@dakrone
Copy link
Member

dakrone commented Aug 19, 2021

Thanks for opening this @gaobinlong, I think it's a good improvement for selecting a node. I will take a look and leave some review comments.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaobinlong, I left a number of comments on this.

Comment on lines 112 to 113
Arrays.stream(indexShardStats.getShards()).mapToLong(shardStats ->
shardStats.getStats().getStore().getSizeInBytes()).sum()).sum();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than do this for all shards, we can do something similar to:

indexShardStats.getPrimary().getStore().getSizeInBytes()

And then avoid the division below by the number of replicas (because we really only care about the size of the primary shards anyway):

Suggested change
Arrays.stream(indexShardStats.getShards()).mapToLong(shardStats ->
shardStats.getStats().getStore().getSizeInBytes()).sum()).sum();
indexShardStats.getPrimary().getStore().getSizeInBytes()).sum();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, getStore() is @Nullable, so there should be protection added to ensure it doesn't throw an NPE when it's null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that when we add index.routing.allocation.require._id setting to the index, either primary shard or replica shard will be reallocated to the selected node, so we should care about the size of both primary shard and replica shard.

Comment on lines +120 to +122
if (indexMetadata.getNumberOfReplicas() != 0) {
indexPrimaryShardsStorageBytes /= indexMetadata.getNumberOfReplicas();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed if we use the primary stats above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation above is used to indicate "how much of the source index's storage" is available per node. As shrink works from both primary and replica I believe it's correct to include replicas in the math.

Comment on lines 129 to 134
if (diskThresholdSettings.getFreeDiskThresholdLow() != 0) {
freeBytesThresholdLow = (long) Math.ceil(nodeTotalBytes *
diskThresholdSettings.getFreeDiskThresholdLow() * 0.01);
} else {
freeBytesThresholdLow = diskThresholdSettings.getFreeBytesThresholdLow().getBytes();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what reason you're doing this here, when constructing the DiskThresholdSettings the bytes are always calculated (see the setLowWatermark(...) call in the constructor), so I think you can always use getFreeBytesThresholdLow() to get the number of bytes rather than doing a calculation with the percentage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By my test, diskThresholdSettings.getFreeBytesThresholdLow() return 0 if we set the low watermark to a percentage and the converse is also true.

Copy link
Contributor

@andreidan andreidan Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, so this is very confusing. @gaobinlong you're right about getFreeBytesThresholdLow returning 0 when the watermark is configured using percentages. Like Lee, I've also been tripped by the setLowWatermark method in DiskThresholdSettings.

For another PR - we should rename the methods used in setLowWatermark to reflect that they only maybe return something. ie. thresholdPercentageFromWatermark -> thresholdPercentageFromWatermarkIfPercentageConfigured or maybeThresholdPercentageFromWatermark
Similar with thresholdBytesFromWatermark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this math into a method in DiskThresholdSettings? ie. getLowWatermarkAsBytes or something named similar? (with corresponding unit tests)

Copy link
Contributor Author

@gaobinlong gaobinlong Apr 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreidan, thanks for the suggestion, I've added some public methods in DiskThresholdSettings and added some unit tests.

if (nodeAvailableBytes > freeBytesThresholdLow + 2 * indexPrimaryShardsStorageBytes -
shardsOnCurrentNodeStorageBytes) {
validRoutingNodes.add(node);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need some sort of signalling here when there isn't enough space, so potentially capturing the event where none of the otherwise valid nodes can hold all the shards, and changing the Exception thrown in that case to include that message.

Comment on lines 164 to 181
List<Map.Entry<String, Long>> nodeShardsStorageList = new ArrayList<>(nodeShardsStorageBytes.entrySet());
nodeShardsStorageList.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));
Optional<String> nodeId = Optional.empty();
for (Map.Entry<String, Long> entry : nodeShardsStorageList) {
// we prefer to select the node which contains the maximum shards storage bytes of the index from the valid node list
if (validNodeIds.contains(entry.getKey())) {
nodeId = Optional.of(entry.getKey());
break;
}
}

// if we cannot find a node which contains any shard of the index,
// shuffle the valid node list and select randomly
if (nodeId.isEmpty()) {
List<String> list = new ArrayList<>(validNodeIds);
Randomness.shuffle(list);
nodeId = list.stream().findAny();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to factor this into a separate method and then make it unit testable, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I've changed the code yet.

Comment on lines 159 to 160
listener.onFailure(new NoNodeAvailableException("could not find any nodes to allocate index [" +
indexName + "] onto prior to shrink"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the exception that we should enhance if there is a not that would normally be valid, but failed because it doesn't have enough space to hold all the primary shards.

listener.onFailure(new NoNodeAvailableException("could not find any nodes to allocate index [" + indexName +
"] onto prior to shrink"));
}
}, listener::onFailure));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires some consideration, for example, what happens if the nodes stats times out, should we fail open (still try to find a node), or fail closed?

I think we should at least fill out the failure handler so that if the nodes stats call fails, the message in the exception that ILM explain will show is more human readable. Something like "failed to retrieve disk information to select a single node for primary shard allocation".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense, I've done that.

@gaobinlong
Copy link
Contributor Author

@dakrone, sorry for the delay, I've pushed a new commit, can you help to take a look?

@dakrone dakrone self-requested a review October 25, 2021 17:13
@arteam arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@dakrone dakrone requested review from andreidan and removed request for dakrone February 16, 2022 22:23
Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this @gaobinlong and apologies for the long delay in this PR.

I think this looks very good. I've left a few more suggestions (nothing major though).

Comment on lines +120 to +122
if (indexMetadata.getNumberOfReplicas() != 0) {
indexPrimaryShardsStorageBytes /= indexMetadata.getNumberOfReplicas();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation above is used to indicate "how much of the source index's storage" is available per node. As shrink works from both primary and replica I believe it's correct to include replicas in the math.

Comment on lines 129 to 134
if (diskThresholdSettings.getFreeDiskThresholdLow() != 0) {
freeBytesThresholdLow = (long) Math.ceil(nodeTotalBytes *
diskThresholdSettings.getFreeDiskThresholdLow() * 0.01);
} else {
freeBytesThresholdLow = diskThresholdSettings.getFreeBytesThresholdLow().getBytes();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract this math into a method in DiskThresholdSettings? ie. getLowWatermarkAsBytes or something named similar? (with corresponding unit tests)

}

if (validNodeIds.size() == 0) {
logger.debug("no nodes have enough disk space to hold one copy of the index [{}] onto prior to shrink ", indexName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this error message is incorrect as we're including the size of the (future) shrunken index in the math as well? We should reflect that in the message, unless I'm misreading this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we only consider the size of the source index here.

if (nodeId.isEmpty()) {
List<String> list = new ArrayList<>(validNodeIds);
Randomness.shuffle(list);
nodeId = list.stream().findAny();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we shuffled above, shall we just get the first one? (to avoid extra allocations by streaming the list)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we should do that, I've changed the code yet.

@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@csoulios csoulios added v8.6.0 and removed v8.5.0 labels Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet