Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookie Client add quarantine ratio when error count exceed threshold #2327

Merged
merged 1 commit into from Aug 13, 2020

Conversation

hangc0276
Copy link
Contributor

Motivation

When bookie client read/write data from/to bookie servers, it will check the health of each connected server in sepecific interval. Once the amount of errors reached the threshold, the bookie server will be quarantined for server miniutes (configurated by bookieQuarantineTimeSeconds) by the bookie client.

In most circumstance, there are large amount of bookie clients connected to one bookie server, like pulsar broker. Once the bookie server runs in heavy load, most of bookie clients will receive errors and trigger quarantine in the same time, and then quarantine the server for several miniutes. After a few miniutes passed by, the quarantined server will be put back in the same time for most bookie clients, which will lead to periodic oscillation of in/out throughput of the server. It is the obstacle of tunning the throughput of the bookkeeper cluster.

Changes

I introduce a quarantine probability to determine whether to quarantine the server for the client, avoiding quaraninte the heavy load server in the same time for most of bookie client.

I also expose the quarantine stats to prometheus.

@hangc0276
Copy link
Contributor Author

/pulsarbot run-failure-checks

@hangc0276
Copy link
Contributor Author

rerun failure checks

@eolivelli
Copy link
Contributor

Hi @hangc0276
are you running this patch in production ?

@karanmehta93 @reddycharan @jvrao @merlimat @ravisharda @diegosalvi @sijie maybe interested in this patch

@eolivelli
Copy link
Contributor

@hangc0276 can please add tests ?

@hangc0276
Copy link
Contributor Author

Hi @hangc0276
are you running this patch in production ?

@karanmehta93 @reddycharan @jvrao @merlimat @ravisharda @diegosalvi @sijie maybe interested in this patch

Yes, I run this patch in my production. With this patch, the standard deviation of the bookies' input throughput decrease from 75MB to 40MB.

@hangc0276
Copy link
Contributor Author

@hangc0276 can please add tests ?

OK, I will add the test case.

@eolivelli eolivelli requested a review from merlimat May 23, 2020 12:49
@hangc0276
Copy link
Contributor Author

ping @sijie @jiazhai please take a look.

@sijie
Copy link
Member

sijie commented May 30, 2020

@hangc0276 Nice contribution!

@sijie sijie added this to the 4.12.0 milestone Aug 13, 2020
@sijie sijie merged commit 7645cb8 into apache:master Aug 13, 2020
sijie pushed a commit that referenced this pull request Aug 13, 2020
### Motivation
When bookie client read/write data from/to bookie servers, it will check the health of each connected server in sepecific interval. Once the amount of errors reached the threshold, the bookie server will be quarantined for server miniutes (configurated by `bookieQuarantineTimeSeconds`) by the bookie client.

In most circumstance, there are large amount of bookie clients connected to one bookie server, like pulsar broker. Once the bookie server runs in heavy load, most of bookie clients will receive errors and trigger quarantine in the same time, and then quarantine the server for several miniutes. After a few miniutes passed by, the quarantined server will be put back in the same time for most bookie clients, which will lead to periodic oscillation of in/out throughput of the server. It is the obstacle of tunning the throughput of the bookkeeper cluster.

### Changes
I introduce a quarantine probability to determine whether to quarantine the server for the client, avoiding quaraninte the heavy load server in the same time for most of bookie client.

I also expose the quarantine stats to prometheus.

Reviewers: Jia Zhai <zhaijia@apache.org>, Sijie Guo <None>

This closes #2327 from hangc0276/bookieClient_Quarantine_ratio

(cherry picked from commit 7645cb8)
Signed-off-by: Sijie Guo <sijie@apache.org>
Ghatage pushed a commit to Ghatage/bookkeeper that referenced this pull request Oct 6, 2020
### Motivation
When bookie client read/write data from/to bookie servers, it will check the health of each connected server in sepecific interval. Once the amount of errors reached the threshold, the bookie server will be quarantined for server miniutes (configurated by `bookieQuarantineTimeSeconds`) by the bookie client.

In most circumstance, there are large amount of bookie clients connected to one bookie server, like pulsar broker. Once the bookie server runs in heavy load, most of bookie clients will receive errors and trigger quarantine in the same time, and then quarantine the server for several miniutes. After a few miniutes passed by, the quarantined server will be put back in the same time for most bookie clients, which will lead to periodic oscillation of in/out throughput of the server. It is the obstacle of tunning the throughput of the bookkeeper cluster.

### Changes
I introduce a quarantine probability to determine whether to quarantine the server for the client, avoiding quaraninte the heavy load server in the same time for most of bookie client.

I also expose the quarantine stats to prometheus.

Reviewers: Jia Zhai <zhaijia@apache.org>, Sijie Guo <None>

This closes apache#2327 from hangc0276/bookieClient_Quarantine_ratio
sijie pushed a commit to apache/pulsar that referenced this pull request Nov 13, 2020
…11.1 (#8546)

### Motivation
After bookie client upgraded to 4.11.1, it support configure bookie client quarantine ratio.
This feature introduced by apache/bookkeeper#2327.

### Changes
Add `bookkeeperClientQuarantineRatio` configuration for broker.conf
flowchartsman pushed a commit to flowchartsman/pulsar that referenced this pull request Nov 17, 2020
…11.1 (apache#8546)

### Motivation
After bookie client upgraded to 4.11.1, it support configure bookie client quarantine ratio.
This feature introduced by apache/bookkeeper#2327.

### Changes
Add `bookkeeperClientQuarantineRatio` configuration for broker.conf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants