-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix producer stucks on creating ledger timeout #7319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
*Motivation* The `ledgerCreated` flag is passed as ctx to the createLedger callback. The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false` when timeout happens. It will trigger the following race condition: a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out. b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true, so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the pending add requests. This race condition only happens when creating ledger times out. ``` public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) { if (log.isDebugEnabled()) { log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1); } if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) { return; } ``` *Modification* The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process the `ledgerCreated` context.
@@ -3177,8 +3177,7 @@ protected void asyncCreateLedger(BookKeeper bookKeeper, ManagedLedgerConfig conf | |||
digestType, config.getPassword(), cb, ledgerCreated, finalMetadata); | |||
scheduledExecutor.schedule(() -> { | |||
if (!ledgerCreated.get()) { | |||
ledgerCreated.set(true); | |||
cb.createComplete(BKException.Code.TimeoutException, null, null); | |||
cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't there be a race condition here (also in existing code) if the creation completes at the same time that the timeout is triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callback is synchronized at ManagedLedgerImpl. So there is no race condition here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not always the case. eg: In createNewMetadataLedger()
when we create a new cursor ledger, we pass a lambda and there's no mutex. The logic in checkAndCompleteLedgerOpTask()
seems race prone.
I think we should probably use compareAndSet() to manipulate that ledgerCreated
flag, which also might be better represented by an enum with multiple states.
Finally (and this is unrelated to the bug here), we should try to cancel the timeout task, once the create operation succeeds or fails before it times out. Otherwise we can get into a state with many timer tasks pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I missed the createNewMetadataLedger operation. I changed it to use CAS.
For canceling the timeout tasks, I will have a follow-up issue to do that.
/pulsarbot run-failure-checks |
* Fix producer stucks on creating ledger timeout *Motivation* The `ledgerCreated` flag is passed as ctx to the createLedger callback. The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false` when timeout happens. It will trigger the following race condition: a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out. b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true, so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the pending add requests. This race condition only happens when creating ledger times out. ``` public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) { if (log.isDebugEnabled()) { log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1); } if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) { return; } ``` *Modification* The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process the `ledgerCreated` context. * Change to use CAS (cherry picked from commit a34f693)
* Fix producer stucks on creating ledger timeout *Motivation* The `ledgerCreated` flag is passed as ctx to the createLedger callback. The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false` when timeout happens. It will trigger the following race condition: a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out. b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true, so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the pending add requests. This race condition only happens when creating ledger times out. ``` public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) { if (log.isDebugEnabled()) { log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1); } if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) { return; } ``` *Modification* The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process the `ledgerCreated` context. * Change to use CAS (cherry picked from commit a34f693)
* Fix producer stucks on creating ledger timeout *Motivation* The `ledgerCreated` flag is passed as ctx to the createLedger callback. The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false` when timeout happens. It will trigger the following race condition: a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out. b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true, so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the pending add requests. This race condition only happens when creating ledger times out. ``` public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) { if (log.isDebugEnabled()) { log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1); } if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) { return; } ``` *Modification* The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process the `ledgerCreated` context. * Change to use CAS (cherry picked from commit a34f693)
* Fix producer stucks on creating ledger timeout *Motivation* The `ledgerCreated` flag is passed as ctx to the createLedger callback. The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false` when timeout happens. It will trigger the following race condition: a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out. b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true, so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the pending add requests. This race condition only happens when creating ledger times out. ``` public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) { if (log.isDebugEnabled()) { log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1); } if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) { return; } ``` *Modification* The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process the `ledgerCreated` context. * Change to use CAS (cherry picked from commit a34f693)
Hi, |
* Fix producer stucks on creating ledger timeout *Motivation* The `ledgerCreated` flag is passed as ctx to the createLedger callback. The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false` when timeout happens. It will trigger the following race condition: a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out. b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true, so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the pending add requests. This race condition only happens when creating ledger times out. ``` public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) { if (log.isDebugEnabled()) { log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1); } if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) { return; } ``` *Modification* The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process the `ledgerCreated` context. * Change to use CAS
Motivation
The
ledgerCreated
flag is passed as ctx to the createLedger callback.The callback already had the logic on handling
ledgerCreated
flag. But we set the flag tofalse
when timeout happens. It will trigger the following race condition:
a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see
ledgerCreated
flag is set to true,so it will cause
checkAndCompleteLedgerOpTask
returns false and exist too early without processing thepending add requests.
This race condition only happens when creating ledger times out.
Modification
The timeout logic shouldn't modify the
ledgerCreated
context. It should let the callback itself to processthe
ledgerCreated
context.