Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation of the number_of_shards parameter in Shrink Action of ILM #74219

Merged
merged 14 commits into from Aug 16, 2021

Conversation

gaobinlong
Copy link
Contributor

Relates to #72724.

The main changes of the PR are:

  1. Add validation of the number_of_shards parameter in the SetSingleNodeAllocateStep, fail the step if number_of_shards is not a factor of the source index's shards count.
  2. Add some test code for the change.

@elasticsearchmachine elasticsearchmachine added the external-contributor Pull request authored by a developer outside the Elasticsearch team label Jun 17, 2021
@mark-vieira mark-vieira added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jun 25, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jun 25, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@andreidan andreidan self-requested a review July 1, 2021 15:35
if (numberOfShards != null) {
int sourceNumberOfShards = indexMetadata.getNumberOfShards();
int factor = sourceNumberOfShards / numberOfShards;
if (factor * numberOfShards < sourceNumberOfShards) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there any particular reason you didn't go with if (sourceNumberOfShards % numberOfShards != 0) {?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a habit, % is better, I think.

@Override
public void performAction(IndexMetadata indexMetadata, ClusterState clusterState,
ClusterStateObserver observer, ActionListener<Boolean> listener) {
String indexName = indexMetadata.getIndex().getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about refactoring all of this into some new Step? I like the logic overall, and I agree that we should do the validation before the single allocation starts, I'm just not sure it belongs within this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good idea, I will refactor the code soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @joegallo , I've refactor the code yet, can you take a look at the new commit?

@andreidan
Copy link
Contributor

@elasticmachine ok to test

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this @gaobinlong

I've left a few suggestions (and I suspect there'll be a few ITs failures) but this is looking good

@gaobinlong
Copy link
Contributor Author

@andreidan, the issues you mentioned above have been resolved now, I noticed the test failures and I'm still working on fixing them.

@gaobinlong
Copy link
Contributor Author

@andreidan , I have no idea about how to fix the test failure in ClusterStateWaitThresholdBreachTests#testWaitInShrunkShardsAllocatedExceedsThreshold:

// ILM will retry the shrink step because the number of shards to shrink to is gt the current number of shards

, as we add a CheckTargetShardsCountStep before the CheckShrinkReadyStep and ShrunkShardsAllocatedStep, the Shrink Action will stuck in the CheckTargetShardsCountStep so the ClusterStateWaitUntilThresholdStep defined in ShrinkAction#toSteps will not execute:

ClusterStateWaitUntilThresholdStep allocated = new ClusterStateWaitUntilThresholdStep(

, maybe we can fail the ShrinkStep by triggering the shard's max doc count limit when executing the ResizeAction:

throw new IllegalStateException("Can't merge index with more than [" + IndexWriter.MAX_DOCS

, but I don't know how to mock the DocsStats.
Can you give me some cue to fix the failure?

@andreidan
Copy link
Contributor

@gaobinlong Great point. I've opened #75695 to make this particular test rely on the shrunk index not being able to allocate (as opposed to the managed index not being able to shrink successfully as it was before).

If you merge the main branch into your branch after #75695 is merged things should start working for this particular test (there might be others that use the same strategy for failing the shrink action - ie. higher configured number of shards - so they'll need a similar change to ClusterStateWaitThresholdBreachTests.

@gaobinlong
Copy link
Contributor Author

@andreidan thanks for your help, I've fixed other thest failures except the ShrinkActionIT#testShrinkStepMovesForwardIfShrunkIndexIsCreatedBetweenRetries:

public void testShrinkStepMovesForwardIfShrunkIndexIsCreatedBetweenRetries() throws Exception {

, as it's hard to make the ShrinkStep failed and ShrinkAction will stuck in the new added CheckTargetShardsCountStep if the target shards count is higher than the source index's shards count, so I delete the test method directly. Can you help to take a look at the new commits?

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this @gaobinlong

I think this is nearly complete.

I agree testShrinkStepMovesForwardIfShrunkIndexIsCreatedBetweenRetries is not feasible anymore, however can you please create a unit test in ShrinkStepTests instead? (basically have a clusterState that already contains the shruken index and assert the success listener method is called once by performAction)

@gaobinlong
Copy link
Contributor Author

@andreidan,I have done that, can you help to take a look again?

@dakrone
Copy link
Member

dakrone commented Aug 9, 2021

@joegallo as @andreidan is on vacation this week, could you take another peek at this PR?

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this @gaobinlong

@andreidan andreidan added the auto-backport Automatically create backport pull requests when merged label Aug 16, 2021
@andreidan andreidan merged commit 58feb4e into elastic:master Aug 16, 2021
andreidan pushed a commit to andreidan/elasticsearch that referenced this pull request Aug 16, 2021
…n of ILM (elastic#74219)

Add validation of the number_of_shards parameter in Shrink Action of ILM

(cherry picked from commit 58feb4e)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
elasticsearchmachine pushed a commit that referenced this pull request Aug 16, 2021
…n of ILM (#74219) (#76565)

Add validation of the number_of_shards parameter in Shrink Action of ILM

(cherry picked from commit 58feb4e)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>

Co-authored-by: bellengao <gbl_long@163.com>
wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Aug 17, 2021
* master: (868 commits)
  Query API key - Rest spec and yaml tests (elastic#76238)
  Delay shard reassignment from nodes which are known to be restarting (elastic#75606)
  Reenable bwc tests for elastic#76475 (elastic#76576)
  Set version to 7.15 in BWC code (elastic#76577)
  Don't remove warning headers on all failure (elastic#76434)
  Disable bwc tests for elastic#76475 (elastic#76541)
  Re-enable bwc tests (elastic#76567)
  Keep track of data recovered from snapshots in RecoveryState (elastic#76499)
  [Transform] Align transform checkpoint range with date_histogram interval for better performance (elastic#74004)
  EQL: Remove "wildcard" function (elastic#76099)
  Fix 'accept' and 'content_type' fields for search_mvt API
  Add persistent licensed feature tracking (elastic#76476)
  Add system data streams to feature state snapshots (elastic#75902)
  fix the error message for instance methods that don't exist (elastic#76512)
  ILM: Add validation of the number_of_shards parameter in Shrink Action of ILM (elastic#74219)
  remove dashboard only reserved role (elastic#76507)
  Fix Stack Overflow in UnassignedInfo in Corner Case (elastic#76480)
  Add (Extended)KeyUsage KeyUsage, CipherSuite & Protocol to SSL diagnostics (elastic#65634)
  Add recovery from snapshot to tests (elastic#76535)
  Reenable BwC Tests after elastic#76532 (elastic#76534)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v7.15.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants