Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return 429 status code when there's a read_only cluster block #50166

Merged
merged 17 commits into from Feb 22, 2020

Conversation

gaobinlong
Copy link
Contributor

@gaobinlong gaobinlong commented Dec 13, 2019

This PR is related to #49393.
The main point of the change is:

  1. Modify the status code from 403 to 429 when there is a cluster or index level block.
  2. Modify the error message of ClusterBlockException.

@jimczi jimczi added :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. >bug labels Dec 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up @gaobinlong. In the issue #49393 I did say:

Similarly, the status codes of other cluster blocks should be reexamined in this context.

However, I think this change took it too far. For example, if a write block is placed on an index manually by a user to prevent any further writes to the index, is that really a retryable situation? I would argue not.

Can you take another stab at thinking through which other blocks should return 429?

@gaobinlong gaobinlong changed the title Return 429 status code when there's a cluster or index block Return 429 status code when there's a read_only cluster block Dec 27, 2019
@gaobinlong
Copy link
Contributor Author

@jasontedor , I have changed the code and only modify the status code to 429 when there is a read_only cluster block:

  • index.read_only
  • index.read_only_allow_delete
  • cluster.read_only
  • cluster.read_only_allow_delete

These cluster blocks have METADATA_WRITE and WRITE block level, I think they can return 429 so user can retry the write operations avoiding data loss.

@jasontedor
Copy link
Member

I’m on vacation, can someone on @elastic/es-distributed review this while I’m out?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extra work on this @gaobinlong . First of all, please do not force-push to PRs, it makes it hard to review the changes made since last time and comments loose their meaning when we cannot see what the original version of the code were.

I think that of the 4 blocks modified here, we should only return TOO_MANY_REQUESTS for the one block where the disk threshold monitor clears the block automatically (i.e., the index level index.blocks.read_only_allow_delete). I worry that the other cases could have legitimate user-facing use cases (for instance manually marking an index read-only) that would break by this change.

Will you look into amending the PR (no force push, please)?

@gaobinlong
Copy link
Contributor Author

@henningandersen Thanks for your comment and advise. I have pushed a new commit and only return TOO_MANY_REQUESTS for index level read_only_allow_delete block. Can you help to review the change?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good. I left two smaller things to address. I will start a CI job for this too.

@henningandersen
Copy link
Contributor

@elasticmachine test this please

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaobinlong . I added a few more comments, AFAICS, the new logic to prefer 403 over 429 will not work, please have a look.

@gaobinlong
Copy link
Contributor Author

Hi @henningandersen, can I take you some time to help to review the new commit I have pushed? Thanks a lot.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaobinlong for the extra iteration. I left a few smaller comments, but otherwise it is looking good. I will kick off a test run too.

@henningandersen
Copy link
Contributor

@elasticmachine test this please

@henningandersen
Copy link
Contributor

@gaobinlong the test failures looks like they could be caused by incompatibilities between this branch and latest changes in master/7.x. Will you merge in master changes into this branch?

@gaobinlong
Copy link
Contributor Author

Hi @henningandersen, I have merged master changes into the branch and optimized some code following your advice.

@henningandersen
Copy link
Contributor

@elasticmachine test this please

@henningandersen
Copy link
Contributor

henningandersen commented Jan 24, 2020

@gaobinlong , nearly there but unfortunately the branch is now again outdated with master/7.x causing the build to fail. I can merge in the changes unless you prefer to do so yourself?

@gaobinlong
Copy link
Contributor Author

@henningandersen, thanks a lot if you can help to merge in master changes.

@henningandersen
Copy link
Contributor

@elasticmachine update branch

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, @gaobinlong . I added a couple of comments. Thanks again for your efforts.

@gaobinlong
Copy link
Contributor Author

@henningandersen, the new commit I have pushed contains some changes following your comment. The difference is that instead of updating the setting cluster.info.update.interval, the cluster info is refreshed manually to ensure the test runs faster, because the setting's minimum value is 10s. Can you have a look on the changes?

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay here. I have one comment left and then I think we are good.

@gaobinlong
Copy link
Contributor Author

@henningandersen,I have pushed a new commit which delay refreshing cluster info using ThreadPool.schedule(). I can see the retry handler can be triggered when the disk allocation decider is enabled. Could you help to review the code changes?

@henningandersen
Copy link
Contributor

@elasticmachine test this please

@henningandersen
Copy link
Contributor

@gaobinlong there are a few checkstyle issues reported by the build, see:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/16094/console

Notice that you can run:

./gradlew precommit

to check all precommit checks before pushing.

Would you mind looking into those?

@gaobinlong
Copy link
Contributor Author

@henningandersen, I have done that and merged in the master changes into the branch.

@henningandersen
Copy link
Contributor

@elasticmachine test this please

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the additional iterations @gaobinlong .

@henningandersen henningandersen merged commit 36bd666 into elastic:master Feb 22, 2020
henningandersen pushed a commit to henningandersen/elasticsearch that referenced this pull request Feb 22, 2020
…#50166)

We consider index level read_only_allow_delete blocks temporary since
the DiskThresholdMonitor can automatically release those when an index
is no longer allocated on nodes above high threshold.

The rest status has therefore been changed to 429 when encountering this
index block to signal retryability to clients.

Related to elastic#49393
henningandersen pushed a commit that referenced this pull request Feb 22, 2020
We consider index level read_only_allow_delete blocks temporary since
the DiskThresholdMonitor can automatically release those when an index
is no longer allocated on nodes above high threshold.

The rest status has therefore been changed to 429 when encountering this
index block to signal retryability to clients.

Related to #49393
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 23, 2020
Users are perennially confused by the message they get when writing to
an index is blocked due to excessive disk usage:

    TOO_MANY_REQUESTS/12/index read-only / allow delete (api)

Of course this is technically accurate but it is hard to join the dots
from this message to "your disk was too full" without some searching of
forums and documentation. Additionally in elastic#50166 we changed the status
code to today's `429` from the previous `403` which changed the message
from the one that's widely documented elsewhere:

    FORBIDDEN/12/index read-only / allow delete (api)

Since elastic#42559 we've considered this block to be under the sole control of
the disk-based shard allocator, and we have seen no evidence to suggest
that anyone is applying this block manually. Therefore this commit
adjusts this block's message to indicate that it's caused by a lack of
disk space.
DaveCTurner added a commit that referenced this pull request Jun 24, 2020
Users are perennially confused by the message they get when writing to
an index is blocked due to excessive disk usage:

    TOO_MANY_REQUESTS/12/index read-only / allow delete (api)

Of course this is technically accurate but it is hard to join the dots
from this message to "your disk was too full" without some searching of
forums and documentation. Additionally in #50166 we changed the status
code to today's `429` from the previous `403` which changed the message
from the one that's widely documented elsewhere:

    FORBIDDEN/12/index read-only / allow delete (api)

Since #42559 we've considered this block to be under the sole control of
the disk-based shard allocator, and we have seen no evidence to suggest
that anyone is applying this block manually. Therefore this commit
adjusts this block's message to indicate that it's caused by a lack of
disk space.
DaveCTurner added a commit that referenced this pull request Jun 24, 2020
Users are perennially confused by the message they get when writing to
an index is blocked due to excessive disk usage:

    TOO_MANY_REQUESTS/12/index read-only / allow delete (api)

Of course this is technically accurate but it is hard to join the dots
from this message to "your disk was too full" without some searching of
forums and documentation. Additionally in #50166 we changed the status
code to today's `429` from the previous `403` which changed the message
from the one that's widely documented elsewhere:

    FORBIDDEN/12/index read-only / allow delete (api)

Since #42559 we've considered this block to be under the sole control of
the disk-based shard allocator, and we have seen no evidence to suggest
that anyone is applying this block manually. Therefore this commit
adjusts this block's message to indicate that it's caused by a lack of
disk space.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants