Optimize single node selecting in Shrink Action of ILM #76206

gaobinlong · 2021-08-06T03:42:51Z

Relates to #67957.

The main changes of this PR are:

Optimize node selecting in SetSingleNodeAllocateStep of ILM's Shrink Action, the processes are here:
(1) Get node stats, only include fs info and index store stats
(2) Calculate each node's shard storage bytes of the source index
(3) Accumulate all nodes' shard storage bytes, to get the source index's primary shards storage bytes.
(4) The nodes which can be allocated two copies of the index's primary shards will be selected, that's because if the file system doesn’t support hard-linking, then all segments are copied into the new shrunken index, the disk must have free bytes below the low watermark to make sure the new shrunken index can be initialized successfully.
(5) Select the best node which contains the maximum shards storage bytes of the source index from the nodes list generated by step (4), because we want to reduce data transfer cost as much as possible.
（6) If we cannot find a node which contains any shard of the source index, then shuffle the valid node list and select a node randomly.
Add some test methods for the changes above.

elasticmachine · 2021-08-12T03:36:29Z

Pinging @elastic/es-core-features (Team:Core/Features)

dakrone · 2021-08-19T16:00:45Z

Thanks for opening this @gaobinlong, I think it's a good improvement for selecting a node. I will take a look and leave some review comments.

dakrone

Thanks @gaobinlong, I left a number of comments on this.

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

dakrone · 2021-08-20T22:40:22Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                            Arrays.stream(indexShardStats.getShards()).mapToLong(shardStats ->
+                                shardStats.getStats().getStore().getSizeInBytes()).sum()).sum();


Rather than do this for all shards, we can do something similar to:

indexShardStats.getPrimary().getStore().getSizeInBytes()

And then avoid the division below by the number of replicas (because we really only care about the size of the primary shards anyway):

Suggested change

Arrays.stream(indexShardStats.getShards()).mapToLong(shardStats ->

shardStats.getStats().getStore().getSizeInBytes()).sum()).sum();

indexShardStats.getPrimary().getStore().getSizeInBytes()).sum();

Also, getStore() is @Nullable, so there should be protection added to ensure it doesn't throw an NPE when it's null

I found that when we add index.routing.allocation.require._id setting to the index, either primary shard or replica shard will be reallocated to the selected node, so we should care about the size of both primary shard and replica shard.

dakrone · 2021-08-20T22:43:06Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                if (indexMetadata.getNumberOfReplicas() != 0) {
+                    indexPrimaryShardsStorageBytes /= indexMetadata.getNumberOfReplicas();
+                }


This can be removed if we use the primary stats above

The calculation above is used to indicate "how much of the source index's storage" is available per node. As shrink works from both primary and replica I believe it's correct to include replicas in the math.

dakrone · 2021-08-20T22:47:18Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                        if (diskThresholdSettings.getFreeDiskThresholdLow() != 0) {
+                            freeBytesThresholdLow = (long) Math.ceil(nodeTotalBytes *
+                                diskThresholdSettings.getFreeDiskThresholdLow() * 0.01);
+                        } else {
+                            freeBytesThresholdLow = diskThresholdSettings.getFreeBytesThresholdLow().getBytes();
+                        }


I'm not sure what reason you're doing this here, when constructing the DiskThresholdSettings the bytes are always calculated (see the setLowWatermark(...) call in the constructor), so I think you can always use getFreeBytesThresholdLow() to get the number of bytes rather than doing a calculation with the percentage?

By my test, diskThresholdSettings.getFreeBytesThresholdLow() return 0 if we set the low watermark to a percentage and the converse is also true.

Gosh, so this is very confusing. @gaobinlong you're right about getFreeBytesThresholdLow returning 0 when the watermark is configured using percentages. Like Lee, I've also been tripped by the setLowWatermark method in DiskThresholdSettings.

For another PR - we should rename the methods used in setLowWatermark to reflect that they only maybe return something. ie. thresholdPercentageFromWatermark -> thresholdPercentageFromWatermarkIfPercentageConfigured or maybeThresholdPercentageFromWatermark
Similar with thresholdBytesFromWatermark

Can we extract this math into a method in DiskThresholdSettings? ie. getLowWatermarkAsBytes or something named similar? (with corresponding unit tests)

@andreidan, thanks for the suggestion, I've added some public methods in DiskThresholdSettings and added some unit tests.

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

dakrone · 2021-08-20T22:51:44Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                        if (nodeAvailableBytes > freeBytesThresholdLow + 2 * indexPrimaryShardsStorageBytes -
+                            shardsOnCurrentNodeStorageBytes) {
+                            validRoutingNodes.add(node);
+                        }


I think we'll need some sort of signalling here when there isn't enough space, so potentially capturing the event where none of the otherwise valid nodes can hold all the shards, and changing the Exception thrown in that case to include that message.

dakrone · 2021-08-20T22:53:28Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                List<Map.Entry<String, Long>> nodeShardsStorageList = new ArrayList<>(nodeShardsStorageBytes.entrySet());
+                nodeShardsStorageList.sort((o1, o2) -> o2.getValue().compareTo(o1.getValue()));
+                Optional<String> nodeId = Optional.empty();
+                for (Map.Entry<String, Long> entry : nodeShardsStorageList) {
+                    // we prefer to select the node which contains the maximum shards storage bytes of the index from the valid node list
+                    if (validNodeIds.contains(entry.getKey())) {
+                        nodeId = Optional.of(entry.getKey());
+                        break;
+                    }
+                }
+
+                // if we cannot find a node which contains any shard of the index,
+                // shuffle the valid node list and select randomly
+                if (nodeId.isEmpty()) {
+                    List<String> list = new ArrayList<>(validNodeIds);
+                    Randomness.shuffle(list);
+                    nodeId = list.stream().findAny();
+                }


I think it'd be useful to factor this into a separate method and then make it unit testable, what do you think?

That's a good idea, I've changed the code yet.

dakrone · 2021-08-20T22:54:14Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                    listener.onFailure(new NoNodeAvailableException("could not find any nodes to allocate index [" +
+                        indexName + "] onto prior to shrink"));


This is the exception that we should enhance if there is a not that would normally be valid, but failed because it doesn't have enough space to hold all the primary shards.

dakrone · 2021-08-20T22:56:15Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                    listener.onFailure(new NoNodeAvailableException("could not find any nodes to allocate index [" + indexName +
+                        "] onto prior to shrink"));
+                }
+            }, listener::onFailure));


This requires some consideration, for example, what happens if the nodes stats times out, should we fail open (still try to find a node), or fail closed?

I think we should at least fill out the failure handler so that if the nodes stats call fails, the message in the exception that ILM explain will show is more human readable. Something like "failed to retrieve disk information to select a single node for primary shard allocation".

Yeah, that makes sense, I've done that.

gaobinlong · 2021-10-18T13:36:55Z

@dakrone, sorry for the delay, I've pushed a new commit, can you help to take a look?

andreidan

Thanks for iterating on this @gaobinlong and apologies for the long delay in this PR.

I think this looks very good. I've left a few more suggestions (nothing major though).

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

andreidan · 2022-02-17T18:00:42Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                if (indexMetadata.getNumberOfReplicas() != 0) {
+                    indexPrimaryShardsStorageBytes /= indexMetadata.getNumberOfReplicas();
+                }


The calculation above is used to indicate "how much of the source index's storage" is available per node. As shrink works from both primary and replica I believe it's correct to include replicas in the math.

andreidan · 2022-02-17T18:02:20Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                        if (diskThresholdSettings.getFreeDiskThresholdLow() != 0) {
+                            freeBytesThresholdLow = (long) Math.ceil(nodeTotalBytes *
+                                diskThresholdSettings.getFreeDiskThresholdLow() * 0.01);
+                        } else {
+                            freeBytesThresholdLow = diskThresholdSettings.getFreeBytesThresholdLow().getBytes();
+                        }


Can we extract this math into a method in DiskThresholdSettings? ie. getLowWatermarkAsBytes or something named similar? (with corresponding unit tests)

andreidan · 2022-02-17T18:03:38Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+                }
+
+                if (validNodeIds.size() == 0) {
+                    logger.debug("no nodes have enough disk space to hold one copy of the index [{}] onto prior to shrink ", indexName);


I believe this error message is incorrect as we're including the size of the (future) shrunken index in the math as well? We should reflect that in the message, unless I'm misreading this.

No, we only consider the size of the source index here.

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

andreidan · 2022-02-17T18:07:13Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java

+        if (nodeId.isEmpty()) {
+            List<String> list = new ArrayList<>(validNodeIds);
+            Randomness.shuffle(list);
+            nodeId = list.stream().findAny();


since we shuffled above, shall we just get the first one? (to avoid extra allocations by streaming the list)

Yeah, I think we should do that, I've changed the code yet.

Optimize single node selecting in Shrink Action of ILM

1878844

elasticsearchmachine added v8.0.0 external-contributor labels Aug 6, 2021

dakrone added :Data Management/ILM+SLM team-discuss labels Aug 12, 2021

elasticmachine added the Team:Data Management label Aug 12, 2021

dakrone self-requested a review August 19, 2021 15:49

dakrone requested changes Aug 20, 2021

View reviewed changes

bellengao added 4 commits September 6, 2021 20:53

merge origin master

4e1e2ff

Merge remote-tracking branch 'origin/master' into shrink-select

3a81878

optimize code

d171ce2

Merge remote-tracking branch 'origin/master' into shrink-select

daac604

dakrone self-requested a review October 25, 2021 17:13

arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

dakrone requested review from andreidan and removed request for dakrone February 16, 2022 22:23

andreidan requested changes Feb 17, 2022

View reviewed changes

bellengao added 4 commits March 4, 2022 16:26

merge origin/master

00203b9

add some doc and modify error message

318739f

Merge remote-tracking branch 'origin/master' into shrink-select

d98434c

Merge remote-tracking branch 'origin/master' into shrink-select

3265472

henningandersen mentioned this pull request Mar 30, 2022

Autoscaling may scale too far during index shrink #85480

Closed

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

bellengao added 2 commits September 26, 2022 12:13

Merge remote-tracking branch 'origin/main' into shrink-select

117eacf

Merge remote-tracking branch 'origin/main' into shrink-select

Loading
Loading status checks…

0281b2d

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

rjernst added v8.8.0 and removed v8.7.0 labels Feb 8, 2023

itizir mentioned this pull request Feb 21, 2023

Fallback to the actual shard size when forecast is not available #93461

Merged

gmarouli added v8.9.0 and removed v8.8.0 labels Apr 26, 2023

pugnascotia added v8.10.0 and removed v8.9.0 labels Jun 22, 2023

quux00 added v8.11.0 and removed v8.10.0 labels Aug 16, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

mattc58 removed the team-discuss label Dec 14, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

elasticsearchmachine added v9.1.0 and removed v9.0.0 labels Jan 30, 2025

elasticsearchmachine added v9.2.0 and removed v9.1.0 labels Jun 26, 2025

		Arrays.stream(indexShardStats.getShards()).mapToLong(shardStats ->
		shardStats.getStats().getStore().getSizeInBytes()).sum()).sum();

	Arrays.stream(indexShardStats.getShards()).mapToLong(shardStats ->
	shardStats.getStats().getStore().getSizeInBytes()).sum()).sum();
	indexShardStats.getPrimary().getStore().getSizeInBytes()).sum();

		listener.onFailure(new NoNodeAvailableException("could not find any nodes to allocate index [" +
		indexName + "] onto prior to shrink"));

Optimize single node selecting in Shrink Action of ILM #76206

Are you sure you want to change the base?

Optimize single node selecting in Shrink Action of ILM #76206

Conversation

gaobinlong commented Aug 6, 2021

Uh oh!

elasticmachine commented Aug 12, 2021

Uh oh!

dakrone commented Aug 19, 2021

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreidan Feb 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaobinlong Apr 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gaobinlong commented Oct 18, 2021

Uh oh!

andreidan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreidan Feb 17, 2022 •

edited

Loading

gaobinlong Apr 4, 2022 •

edited

Loading