New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookie Client add quarantine ratio when error count exceed threshold #2327
Conversation
… the bookie server in the same time
/pulsarbot run-failure-checks |
rerun failure checks |
Hi @hangc0276 @karanmehta93 @reddycharan @jvrao @merlimat @ravisharda @diegosalvi @sijie maybe interested in this patch |
@hangc0276 can please add tests ? |
Yes, I run this patch in my production. With this patch, the standard deviation of the bookies' input throughput decrease from 75MB to 40MB. |
OK, I will add the test case. |
@hangc0276 Nice contribution! |
### Motivation When bookie client read/write data from/to bookie servers, it will check the health of each connected server in sepecific interval. Once the amount of errors reached the threshold, the bookie server will be quarantined for server miniutes (configurated by `bookieQuarantineTimeSeconds`) by the bookie client. In most circumstance, there are large amount of bookie clients connected to one bookie server, like pulsar broker. Once the bookie server runs in heavy load, most of bookie clients will receive errors and trigger quarantine in the same time, and then quarantine the server for several miniutes. After a few miniutes passed by, the quarantined server will be put back in the same time for most bookie clients, which will lead to periodic oscillation of in/out throughput of the server. It is the obstacle of tunning the throughput of the bookkeeper cluster. ### Changes I introduce a quarantine probability to determine whether to quarantine the server for the client, avoiding quaraninte the heavy load server in the same time for most of bookie client. I also expose the quarantine stats to prometheus. Reviewers: Jia Zhai <zhaijia@apache.org>, Sijie Guo <None> This closes #2327 from hangc0276/bookieClient_Quarantine_ratio (cherry picked from commit 7645cb8) Signed-off-by: Sijie Guo <sijie@apache.org>
### Motivation When bookie client read/write data from/to bookie servers, it will check the health of each connected server in sepecific interval. Once the amount of errors reached the threshold, the bookie server will be quarantined for server miniutes (configurated by `bookieQuarantineTimeSeconds`) by the bookie client. In most circumstance, there are large amount of bookie clients connected to one bookie server, like pulsar broker. Once the bookie server runs in heavy load, most of bookie clients will receive errors and trigger quarantine in the same time, and then quarantine the server for several miniutes. After a few miniutes passed by, the quarantined server will be put back in the same time for most bookie clients, which will lead to periodic oscillation of in/out throughput of the server. It is the obstacle of tunning the throughput of the bookkeeper cluster. ### Changes I introduce a quarantine probability to determine whether to quarantine the server for the client, avoiding quaraninte the heavy load server in the same time for most of bookie client. I also expose the quarantine stats to prometheus. Reviewers: Jia Zhai <zhaijia@apache.org>, Sijie Guo <None> This closes apache#2327 from hangc0276/bookieClient_Quarantine_ratio
…11.1 (#8546) ### Motivation After bookie client upgraded to 4.11.1, it support configure bookie client quarantine ratio. This feature introduced by apache/bookkeeper#2327. ### Changes Add `bookkeeperClientQuarantineRatio` configuration for broker.conf
…11.1 (apache#8546) ### Motivation After bookie client upgraded to 4.11.1, it support configure bookie client quarantine ratio. This feature introduced by apache/bookkeeper#2327. ### Changes Add `bookkeeperClientQuarantineRatio` configuration for broker.conf
Motivation
When bookie client read/write data from/to bookie servers, it will check the health of each connected server in sepecific interval. Once the amount of errors reached the threshold, the bookie server will be quarantined for server miniutes (configurated by
bookieQuarantineTimeSeconds
) by the bookie client.In most circumstance, there are large amount of bookie clients connected to one bookie server, like pulsar broker. Once the bookie server runs in heavy load, most of bookie clients will receive errors and trigger quarantine in the same time, and then quarantine the server for several miniutes. After a few miniutes passed by, the quarantined server will be put back in the same time for most bookie clients, which will lead to periodic oscillation of in/out throughput of the server. It is the obstacle of tunning the throughput of the bookkeeper cluster.
Changes
I introduce a quarantine probability to determine whether to quarantine the server for the client, avoiding quaraninte the heavy load server in the same time for most of bookie client.
I also expose the quarantine stats to prometheus.