Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive capacity management (RateLimiter + BulkHead) decorator #201

Open
RobWin opened this issue Feb 6, 2018 · 36 comments
Open

Adaptive capacity management (RateLimiter + BulkHead) decorator #201

RobWin opened this issue Feb 6, 2018 · 36 comments

Comments

@RobWin
Copy link
Member

RobWin commented Feb 6, 2018

See concept: https://www.youtube.com/watch?v=m64SWl9bfvk

@storozhukBM
Copy link
Member

Implementation thoughts.

As Jon Moore mentioned in his talk we will can increased latency as implicit indication for adaptive concurrency adjustments.

Adaptive capacity management prerequisites

  1. You should have a system with relatively stable response latency, because we will use latency measures to adapt concurrency limits.
  2. To configure you system properly you should figure out two things:
    2.1 Desirable average throughput per second (later X) [Example: 30 req/sec]
    2.2 Desirable average request latency in seconds per operation (later R) [Example: 0.1 sec/op]
    2.3 Maximum acceptable request latency (later Rmax). This number should be set wisely, because it can eliminate all adaptive capabilities, system will do its best to never reach such latency, so you can set it 20-30 % higher than your usual average latency.

Implementation

Resilience4j will provide new Bulkhead implementation called AdaptiveBulkhead. It will have following config params:

  1. Desirable average throughput = X
  2. Desirable request latency = R
  3. Maximum acceptable request latency = Rmax (default R * 1.3)
  4. Window duration for adaptation = Wa (default 5 sec)
  5. Window duration for reconfiguration = Wr (default 900 sec)

From this params we will calculate:

  1. Initial average number of concurrent requests (later N).
    Example: N = X * R = 30 * 0.1 = 3 [op]
  2. Initial max latency of current window (later cRmax).
    Example: cRmax = min(R * 1.2, Rmax) = min(0.1 * 1.2, 0.13) = 0.12 [op]
  3. Size of adaptation sliding window (later WaN)
    Example: WaN = Wa * X = 30 * 5 = 150
  4. Size of reconfiguration sliding window (later WrN)
    Example: WrN = Wr / Wa = 900 / 5 = 180

From now on we will have two separate functions working inside AdaptiveBulkhead:

First function is constantAdaptation

It will constantly measure request latencies and add them to adaptation sliding window.
After each cycle with duration of Wa, constantAdaptation function will calculate current mean latency from adaptation window [cR]. Add this cR to reconfiguration sliding window. Then it will compare cR with cRmax.
If [cR] < cRmax constantAdaptation function will raise number of concurrent requests N by one, otherwise it will multiply N by 0.75.
Also it will use cR to calculate max permission wait time as R - cR.
So for next cycle Bulkhead will run with new params.

Second function is configurationAdjustment

This function will be triggered once per Wr and it will calculate standard deviation of latencies from reconfiguration sliding window and will try to recalculate new cRmax.
cRmax = min(cR + standard deviation of reconfiguration window, Rmax)`

This configurationAdjustment will help us handle daily latency changes gracefully without reaching cRmax often.

@storozhukBM
Copy link
Member

@RobWin if you don't have any concerns I'll proceed with implementation.

@RobWin
Copy link
Member Author

RobWin commented Feb 13, 2018

Can we implement this with a reactive streaming library like Reactor or RxJava?
We consume events from our existing RateLimiter or Bulkhead and reconfigure them at runtime?

@storozhukBM
Copy link
Member

Yes, I think it is possible, but such implementation will have some performance drawbacks and will create runtime dependency from streaming library.

@RobWin
Copy link
Member Author

RobWin commented Feb 13, 2018

I'm looking forward to your implementation 👍

@storozhukBM
Copy link
Member

storozhukBM commented Mar 27, 2018

Graph from the first integration test
measurement_01

The main idea is working. Now I'll polish the implementation and make additional performance and integration tests.

@RobWin RobWin added this to the 0.14.0 milestone Jun 18, 2018
@RobWin
Copy link
Member Author

RobWin commented Mar 21, 2019

Can we progress with this?

@RobWin RobWin modified the milestones: 0.14.0, 0.15.0 Mar 22, 2019
@RobWin
Copy link
Member Author

RobWin commented Mar 22, 2019

I think the multiplicative-decrease factor and additive-increase value should also be configurable.

@RobWin
Copy link
Member Author

RobWin commented Mar 22, 2019

AIMD requires a signal of congestion. Most frequently, response time outs or bad response times serve as an implicit signal.
Do you think it also makes to check the content of successful responses if the server explicitly signals that the client should back off?

@storozhukBM
Copy link
Member

Yep, we can provide external API to trigger back off, so our users can use any other type of analysis pluggable or written by hand.

@Romeh Romeh removed this from the 0.15.0 milestone Jun 3, 2019
@Romeh Romeh added this to Review in progress in Release v0.17.0 Jun 25, 2019
@Romeh Romeh self-assigned this Jun 25, 2019
@liangGTY
Copy link

liangGTY commented Aug 6, 2019

hi, @RobWin I would like to ask if there is any progress, when will it be released?

@RobWin
Copy link
Member Author

RobWin commented Aug 6, 2019

Yes, there is progress. But not sure when it's ready to be released.

@RobWin
Copy link
Member Author

RobWin commented Oct 6, 2020

image

Our AdaptiveBulkhead implements the TCP Congestion Avoidance algorithm in a protocol-agnostic way.
Instead of a Congestion Window Size, we have a Concurrency Limit.

TCP uses a mechanism called slow start to increase the congestion window after a connection is initialized or after a timeout. It starts with a certain window size. Although the initial rate is low, the rate of increase is very rapid; for every packet acknowledged, the congestion window increases by 1 MSS so that the congestion window effectively doubles for every round-trip time (RTT).

Our AdaptiveBulkhead is a state machine with two states: SLOW_START and CONGESTION_AVOIDANCE.
The user can configure:

  • Initial Concurrency Limit
  • Slow Start Concurrency Threshold
  • Minimum Concurrency Limit
  • Maximum Concurrency Limit
  • Increase Summand (default: 1)
  • Increase Multiplier (default: 2)
  • Decrease Multiplier (default: 0.5)

In the SLOW_START state the AdaptiveBulkhead will increase the concurrency limit for every successfuly call exponentially (by using "Increase Multiplier") until the slow-start threshold is reached, which is the "Slow Start Concurrency Threshold", then it enters the CONGESTION_AVOIDANCE state. In the CONGESTION_AVOIDANCE state the concurrency limit is increased linearly (by using "Increase Summand").

The AdaptiveBulkhead uses the same Sliding Window implementation like the CircuitBreaker to track successful/failed calls. The Sliding Window is used to calculate a failure rate (and slow call rate) based on the last N calls or calls of the last N seconds. The failure rate increases when certain exceptions occur, e.g. TimeoutException, IOException or TooManyCallsExceptions. If the failure rate (or slow call rate) threshold is exceeded, the AdaptiveBulkhead will multiplicative-decrease the concurrency limit and reset the Sliding Window.

If the Minimum Concurrency Limit is reached, it will enter SLOW_START state again.

Now we need to map this to the AdaptiveRateLimiter ;)
You are right, the AdaptiveRateLimiter shoud not multiplicative-decrease the rate limit on every failed call. But the RateLimiter is currently not tracking a failure rate. What is your suggestion without chaning the existing RateLimiter too much?

@adwsingh
Copy link
Contributor

adwsingh commented Oct 8, 2020

Hi are we also planning to cover circuit breaker closing using AIMD strategy here? The circuit breaker would gradually move from OPEN -> HALF_OPEN -> CLOSED based on AIMD.

@RobWin
Copy link
Member Author

RobWin commented Oct 8, 2020

Not sure if an adaptive CircuitBreaker is really nessecarry, if there is an AdaptiveBulkhead based on AIMD. I have the feeling that the changes required in the CircuitBreaker would be too many.

We could rather need support to implement the AdaptiveBulkhead.
OPEN -> Concurrency Limit = 0
HALF_OPEN is SLOW_START phase
CLOSED is CONGESTION_AVOIDANCE phase

If this doesn't match your requirements, we should implement an AdaptiveCircuitBreaker, but not touch the existing too much.

@adwsingh
Copy link
Contributor

adwsingh commented Oct 8, 2020

Makes sense. For Adaptive Bulkhead, is the implementation done or can I help in building it?

hexmind added a commit that referenced this issue Apr 15, 2023
Issue #201 adaptive bulkhead config and time fixed
hexmind added a commit to hexmind/resilience4j that referenced this issue Apr 29, 2023
hexmind added a commit to hexmind/resilience4j that referenced this issue Apr 29, 2023
hexmind added a commit to hexmind/resilience4j that referenced this issue May 2, 2023
hexmind added a commit to hexmind/resilience4j that referenced this issue May 3, 2023
RobWin pushed a commit that referenced this issue May 8, 2023
* Issue #201 EventPublisher hierarchy fixed

* Issue #201 onResult added to AdaptiveBulkhead

* Issue #201 states extracted to own files
hexmind added a commit to hexmind/resilience4j that referenced this issue Jul 17, 2023
hexmind added a commit to hexmind/resilience4j that referenced this issue Jul 19, 2023
hexmind added a commit to hexmind/resilience4j that referenced this issue Jul 19, 2023
hexmind added a commit to hexmind/resilience4j that referenced this issue Jul 21, 2023
@scott-deboy
Copy link

scott-deboy commented Nov 29, 2023

Hey folks - this could really be useful for a project I'm involved with at work.

Could someone share what work is outstanding in order to consider this complete and get it merged? I may be able to contribute to the effort, or get others involved.

hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
* Issue resilience4j#201 EventPublisher hierarchy fixed

* Issue resilience4j#201 onResult added to AdaptiveBulkhead

* Issue resilience4j#201 states extracted to own files
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
* Issue resilience4j#201 EventPublisher hierarchy fixed

* Issue resilience4j#201 onResult added to AdaptiveBulkhead

* Issue resilience4j#201 states extracted to own files
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
hexmind added a commit to hexmind/resilience4j that referenced this issue Jan 5, 2024
RobWin added a commit that referenced this issue Jan 5, 2024
* Issue #201 duplication removed from PredicateCreator clients (#1921)

* Move toString() out of the builder class (#1913)

Co-authored-by: Mike Dias <mdias@atlassian.com>

* docs: remove deprecated http client (#1930)

Co-authored-by: hongbin-deali <hongbin.kim@deali.net>

* configuration for native image added (#1883)

Co-authored-by: Andreas Mautsch <amautsch>

* Bump actions/checkout from 3.2.0 to 3.4.0 (#1911)

Bumps [actions/checkout](https://github.com/actions/checkout) from 3.2.0 to 3.4.0.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@v3.2.0...v3.4.0)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump actions/cache from 3.0.11 to 3.3.1 (#1907)

Bumps [actions/cache](https://github.com/actions/cache) from 3.0.11 to 3.3.1.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](actions/cache@v3.0.11...v3.3.1)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Issue #1875: add copyright information to jar (#1876)

* add copyright information to jar

* add license file to jar

---------

Co-authored-by: TWI <audi-connect@msg.group>

* Add missing validation of config fields (#1931)

* Add missing validation of config fields

Document existing assumptions

* Fix range in exception message

* Mentioned removed deprecated properties in v2.0.0 changelog (#1954)

Co-authored-by: Łukasz Nowak <lukasz.nowak@idemia.com>

* Adding unchecked() method to CheckedConsumer (#1981)

Co-authored-by: Muhammad Sohail <muhammad.sohail@autodesk.com>

* FallbackMethod supports AOP (#1965)

* Add AOP in FallbackMethod

* Add test

---------

Co-authored-by: 임수현 <soohyun.lim2@cj.net>

* Prepare release 2.1.0

* Prepare release 2.1.0

* Updated version to 2.2.0-SNAPSHOT

* Issue #1761: Async retry doesn't emit event bugfix (#1986)

Co-authored-by: Hrbacek, David <david.hrbacek@firma.seznam.cz>

* Issue #1600: Metric for total number of invoked calls from retry (#1895)

* Issue #1600: Metric for total number of invoked calls from retry

* Issue #1600: Metric for total number of invoked calls from retry. Update

* Issue #1600: Metric for total number of invoked calls from retry. Update

* fix Feign fallback from lambda (#1999)

* fix Feign fallback from lambda

* Provide test cases for pr#1999

* Micrometer Timer decorator (#1989)

* Timer reactive support (#2009)

* Timer spring support (#2020)

* Discussion #1962: Added apache commons configuration based registries bootstraping (#1991)

* Removing stale retry configurations from configuration map #2037 (#2039)

* Bump org.jetbrains.kotlin.jvm from 1.7.22 to 1.9.0 (#1990)

Bumps [org.jetbrains.kotlin.jvm](https://github.com/JetBrains/kotlin) from 1.7.22 to 1.9.0.
- [Release notes](https://github.com/JetBrains/kotlin/releases)
- [Changelog](https://github.com/JetBrains/kotlin/blob/master/ChangeLog.md)
- [Commits](JetBrains/kotlin@v1.7.22...v1.9.0)

---
updated-dependencies:
- dependency-name: org.jetbrains.kotlin.jvm
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fixed Issue #2040: Micronaut BulkheadInterceptor always uses the "default" configuration in micronaut application

* Support Micronaut 4 (#1951)

* FallbackMethod supports AOP (#2058)

* Add AOP in FallbackMethod

* Add test

* Fixed the fallback method bug

---------

Co-authored-by: 임수현 <soohyun.lim2@cj.net>

* Support class name using SpEL expression at Circuitbreaker annotation name field. (#2053)

* Issue #1565: Do not retry if IntervalFunction returns interval less than 0. Emit correct events based on result of IF. (#2067)

* Fixed time-based tests

* Updated version to 2.2.0

* Do not interrupt future if running (#2072)

* Avoid to add  duplicate consumer (#2074)

* Updated version to 2.3.0-SNAPSHOT

* Improved EventProcessorTest

* Fix bug in Retry AsyncContext's onResult consumeResultBaeforeRetryAttempt (#2035)

* Adaptive bulkhead 2023.1.4 (#1873)

* Super early draft of adaptive bulkhead.
!!! NOT TESTED !!!

* Small proof of concept test.

* Code cleanup. Builder for BulkheadAdaptationConfig

* Small code clean up. Metrics implementation. Additional TODOs

* Additional TODOs

* Small renaming

* first round of adaptive bulkhead updates

* first round of adaptive bulkhead updates

* first round of adaptive bulkhead updates

* second round of adaptive bulkhead updates

* Sonar fixes

* adaptive bulkhead refactoring to match last discussion

* adaptive bulkhead refactoring to match last discussion

* adaptive bulkhead refactoring to match last discussion

* adaptive bulkhead refactoring to match last discussion

* adaptive bulkhead refactoring to match last discussion

* adaptive bulkhead refactoring to match last discussion

* Sonar fixes

* reduce code duplication

* add exception predicate unit testing

* first round of updates after introducing sliding window with AIMD limiter

* Sonar fixes

* Sonar fixes

* first round of updates for making adaptive policy independent from bulkhead

* first round of applying the review comments

* first round of applying the review comments

* second round of applying the review comments

* Third round of applying the review comments

* fix javadoc

* sonar fixes

* review comments round 4

* code cleanup

* review comments round 5

* review comments round 5

* review comments round 5

* #review comments

* Sonar fixes

* increase test coverage

* first round of review comments apply

* first round of review comments apply

* fix compile issue

* tune the adaptive bulkhead test

* New Draft

* Adaptive Bulkhead Draft

* Adaptive Bulkhead 2 Draft (#1306)

* Remove AIMD leftovers (#1397)

* Issue#651: Support to exclude MetricsAutoConfiguration

* Propagate clock into CircuitBreakerMetrics to allow mocked time-based tests. (#844)

* Issue #596: The Spring Boot fallback method is not invoked when a BulkheadFullException occurs #847

The ThreadPoolBulkhead is does not return a CompletionStage when a task could not be submitted, because the Bulkhead is full. The ThreadPoolBulkhead throws a BulkheadFullException instead. Analog to the ThreadPoolExecutor which throws the RejectedExecutionException. The BulkheadFullException is not handled correctly by the BulkheadAspect.
The BulkheadAspect should convert the BulkheadFullException into a exceptionally completed future so that the FallbackDecorator works as expected.

* Added timelimiter support for resilience4j-ratpack (#865)

* Sonar critical issues solved: (#876)

* Remove usage of generic wildcard type.
* Define a constant instead of duplicating this literal
* Refactor this method to reduce its Cognitive Complexity
* Add a nested comment explaining why this method is empty
* Class names should not shadow interfaces or superclasses
Warnings removed.

* Updated Gradle, Migrated from Bintray to Sonatype, migrated to GitHub Actions

Co-authored-by: Robert Winkler <rwinkler@telekom.de>

* Fixed sonar smells

* Small proof of concept test.

* first round of adaptive bulkhead updates

* second round of adaptive bulkhead updates

* Sonar fixes

* first round of applying the review comments

* Sonar fixes

* first round of review comments apply

* tune the adaptive bulkhead test

* New Draft

* Adaptive Bulkhead 2 Draft (#1306)

* Remove AIMD leftovers (#1397)

* updated to master

---------

Co-authored-by: bstorozhuk <storozhuk.b.m@gmail.com>
Co-authored-by: Mahmoud Romeh <mahmoud.romeh@collibra.com>
Co-authored-by: Robert Winkler <rwinkler@telekom.de>
Co-authored-by: KrnSaurabh <39181662+KrnSaurabh@users.noreply.github.com>
Co-authored-by: Robert Winkler <robwin@t-online.de>
Co-authored-by: Dan Maas <daniel.maas@target.com>

* Issue #201 bulkhead events corrected

* Issue #201 BulkheadOnLimitIncreasedEvent, BulkheadOnLimitDecreasedEvent merged into one event

* Issue #201 adaptive bulkhead config and time fixed

* Adaptive bulkhead 4.1 (#1947)

* Issue #201 EventPublisher hierarchy fixed

* Issue #201 onResult added to AdaptiveBulkhead

* Issue #201 states extracted to own files

* Issue #201 AdaptiveBulkhead states simplified (#1995)

* Issue #201 IncreaseInterval for SlowStartState added (#2001)

* Issue #201 timeUnit replaced by currentTimestamp (#2007)

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Mike Dias <mike.rodrigues.dias@gmail.com>
Co-authored-by: Mike Dias <mdias@atlassian.com>
Co-authored-by: Hongbin Kim <fusis1@naver.com>
Co-authored-by: hongbin-deali <hongbin.kim@deali.net>
Co-authored-by: Andreas Mautsch <amautsch@gmx.de>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: tobi5775 <50146675+tobi5775@users.noreply.github.com>
Co-authored-by: TWI <audi-connect@msg.group>
Co-authored-by: Karol Nowak <nowkarol+github@gmail.com>
Co-authored-by: Łukasz Nowak <38426907+nluk@users.noreply.github.com>
Co-authored-by: Łukasz Nowak <lukasz.nowak@idemia.com>
Co-authored-by: Muhammad Sohail <mhsohail56@gmail.com>
Co-authored-by: Muhammad Sohail <muhammad.sohail@autodesk.com>
Co-authored-by: SOOHYUN-LIM <sh.lim7682@gmail.com>
Co-authored-by: 임수현 <soohyun.lim2@cj.net>
Co-authored-by: Robert Winkler <rwinkler@telekom.de>
Co-authored-by: David Hrbacek <35535320+b923@users.noreply.github.com>
Co-authored-by: Hrbacek, David <david.hrbacek@firma.seznam.cz>
Co-authored-by: Oleksandr L <laviua@users.noreply.github.com>
Co-authored-by: Kerwin Bryant <kerwin612@qq.com>
Co-authored-by: Mariusz Kopylec <mariusz.kopylec@o2.pl>
Co-authored-by: Mariusz Kopylec <mariusz.kopylec@allegro.pl>
Co-authored-by: Deepak Kumar <deep.rnj@gmail.com>
Co-authored-by: tanuja5 <1997.tanuja@gmail.com>
Co-authored-by: Graeme Rocher <graeme.rocher@gmail.com>
Co-authored-by: seokgoon28 <148044191+seokgoon28@users.noreply.github.com>
Co-authored-by: jattisha <jovanattisha@yahoo.com>
Co-authored-by: Hartigan <hartigans@live.com>
Co-authored-by: shijun_deng <dengshijun1992@qq.com>
Co-authored-by: mbio <qnwlqnwlxm@naver.com>
Co-authored-by: bstorozhuk <storozhuk.b.m@gmail.com>
Co-authored-by: Mahmoud Romeh <mahmoud.romeh@collibra.com>
Co-authored-by: KrnSaurabh <39181662+KrnSaurabh@users.noreply.github.com>
Co-authored-by: Robert Winkler <robwin@t-online.de>
Co-authored-by: Dan Maas <daniel.maas@target.com>
@rchinmay
Copy link

Hello guys, this seems to be in progress since a very long time. I can join in and help if necessary. Can someone please point out to the current branch where progress is being made, since there seems to be a lot of branches related to this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Release v0.17.0
  
Review in progress
Development

No branches or pull requests