Skip to content

Elasticsearch does not indicate retryability when flood stage is exceeded #49393

@jasontedor

Description

@jasontedor
Member

Today if a node exceeds the disk flood stage watermark, the disk threshold monitor will apply a special read-only index block to any indices that have a shard allocated to the node that exceeded the watermark. This block carries with it a forbidden status code so that if an attempt is made to index into such an index, the client receives a HTTP 403 status code.

Clients assume that a 403 status code is not retryable and they drop data.

This situation is retryable though, as once the disk threshold monitor observes the free disk space go above the appropriate threshold, the index block is automatically removed.

Rather than expecting our clients to all account for this situation (by inspecting the specifics of the exception that led to the 403 status code), we should indicate retryability by using HTTP status code 429. While 429 is often translated as "too many requests", the HTTP specification is liberal about what this means:

Note that this specification does not define how the origin server identifies the user, nor how it counts requests. For example, an origin server that is limiting request rates can do so based upon counts of requests on a per-resource basis, across the entire server, or even among a set of servers.

By making this change, all of our clients can start retrying when faced with an index that was marked read-only due to a flood stage watermark exceeded event.

Similarly, the status codes of other cluster blocks should be reexamined in this context.

Activity

added
:Distributed Indexing/CRUDA catch all label for issues around indexing, updating and getting a doc by id. Not search.
on Nov 20, 2019
elasticmachine

elasticmachine commented on Nov 20, 2019

@elasticmachine
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

gaobinlong

gaobinlong commented on Dec 2, 2019

@gaobinlong
Contributor

Hi @jasontedor , I'm intersted in this issue. Should we return 429 status code if the cluster block is set manually rather than set automaticly when the flood stage is exceeded?

jasontedor

jasontedor commented on Dec 10, 2019

@jasontedor
MemberAuthor

@gaobinlong I think it's fine to treat them the same. I wish we had an easy way to distinguish when it's automatically set versus when it's not, be we don't really so let's proceed to treat them as the same.

gaobinlong

gaobinlong commented on Dec 10, 2019

@gaobinlong
Contributor

@jasontedor ok, I got it.

gaobinlong

gaobinlong commented on Dec 13, 2019

@gaobinlong
Contributor

Hi @jasontedor , I hava made a PR for this issue, can you help to review the code change?

added
Team:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
on May 4, 2020
zez3

zez3 commented on Mar 27, 2021

@zez3

#50166

This PR valid from 7.7 onwards has been brought to my attention

DaveCTurner

DaveCTurner commented on Jul 30, 2021

@DaveCTurner
Contributor

Closed by #50166.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed Indexing/CRUDA catch all label for issues around indexing, updating and getting a doc by id. Not search.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.help wantedadoptme

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @rjernst@ywelsch@jasontedor@DaveCTurner@gaobinlong

        Issue actions

          Elasticsearch does not indicate retryability when flood stage is exceeded · Issue #49393 · elastic/elasticsearch