Skip to content

Fix producer stucks on creating ledger timeout #7319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 22, 2020

Conversation

sijie
Copy link
Member

@sijie sijie commented Jun 20, 2020

Motivation

The ledgerCreated flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling ledgerCreated flag. But we set the flag to false
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see ledgerCreated flag is set to true,
so it will cause checkAndCompleteLedgerOpTask returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }

Modification

The timeout logic shouldn't modify the ledgerCreated context. It should let the callback itself to process
the ledgerCreated context.

Unverified

This user has not yet uploaded their public signing key.
*Motivation*

The `ledgerCreated` flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

```
    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }
```

*Modification*

The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
the `ledgerCreated` context.
@@ -3177,8 +3177,7 @@ protected void asyncCreateLedger(BookKeeper bookKeeper, ManagedLedgerConfig conf
digestType, config.getPassword(), cb, ledgerCreated, finalMetadata);
scheduledExecutor.schedule(() -> {
if (!ledgerCreated.get()) {
ledgerCreated.set(true);
cb.createComplete(BKException.Code.TimeoutException, null, null);
cb.createComplete(BKException.Code.TimeoutException, null, ledgerCreated);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't there be a race condition here (also in existing code) if the creation completes at the same time that the timeout is triggered?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback is synchronized at ManagedLedgerImpl. So there is no race condition here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not always the case. eg: In createNewMetadataLedger() when we create a new cursor ledger, we pass a lambda and there's no mutex. The logic in checkAndCompleteLedgerOpTask() seems race prone.

I think we should probably use compareAndSet() to manipulate that ledgerCreated flag, which also might be better represented by an enum with multiple states.

Finally (and this is unrelated to the bug here), we should try to cancel the timeout task, once the create operation succeeds or fails before it times out. Otherwise we can get into a state with many timer tasks pending.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I missed the createNewMetadataLedger operation. I changed it to use CAS.

For canceling the timeout tasks, I will have a follow-up issue to do that.

@jiazhai
Copy link
Member

jiazhai commented Jun 20, 2020

/pulsarbot run-failure-checks

Unverified

This user has not yet uploaded their public signing key.
@merlimat merlimat merged commit a34f693 into apache:master Jun 22, 2020
jiazhai pushed a commit that referenced this pull request Jun 23, 2020
* Fix producer stucks on creating ledger timeout

*Motivation*

The `ledgerCreated` flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

```
    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }
```

*Modification*

The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
the `ledgerCreated` context.

* Change to use CAS
(cherry picked from commit a34f693)
jiazhai pushed a commit that referenced this pull request Jun 23, 2020
* Fix producer stucks on creating ledger timeout

*Motivation*

The `ledgerCreated` flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

```
    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }
```

*Modification*

The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
the `ledgerCreated` context.

* Change to use CAS
(cherry picked from commit a34f693)
codelipenghui pushed a commit to streamnative/pulsar-archived that referenced this pull request Jul 14, 2020
* Fix producer stucks on creating ledger timeout

*Motivation*

The `ledgerCreated` flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

```
    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }
```

*Modification*

The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
the `ledgerCreated` context.

* Change to use CAS

(cherry picked from commit a34f693)
cdbartholomew pushed a commit to kafkaesque-io/pulsar that referenced this pull request Jul 24, 2020
* Fix producer stucks on creating ledger timeout

*Motivation*

The `ledgerCreated` flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

```
    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }
```

*Modification*

The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
the `ledgerCreated` context.

* Change to use CAS
(cherry picked from commit a34f693)
@wolfstudy wolfstudy removed this from the 2.7.0 milestone Jul 29, 2020
@wolfstudy wolfstudy added this to the 2.7.0 milestone Aug 12, 2020
@lhotari
Copy link
Member

lhotari commented Aug 20, 2020

Hi,
We are facing this issue in chaos monkey type of testing of Pulsar, with Pulsar version 2.5.2 .
It's great that this issue is already fixed. Will this fix be available in Pulsar 2.6.1 ? How about Pulsar 2.5.3 ?
Thank you!

huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
* Fix producer stucks on creating ledger timeout

*Motivation*

The `ledgerCreated` flag is passed as ctx to the createLedger callback.
The callback already had the logic on handling `ledgerCreated` flag. But we set the flag to `false`
when timeout happens. It will trigger the following race condition:

a) The createComplete callback is triggered when timeout. But the pending add requests are not error'd out.
b) If the createComplete callback eventually completes, it will see `ledgerCreated` flag is set to true,
so it will cause `checkAndCompleteLedgerOpTask` returns false and exist too early without processing the
pending add requests.

This race condition only happens when creating ledger times out.

```
    public synchronized void createComplete(int rc, final LedgerHandle lh, Object ctx) {
        if (log.isDebugEnabled()) {
            log.debug("[{}] createComplete rc={} ledger={}", name, rc, lh != null ? lh.getId() : -1);
        }

        if (checkAndCompleteLedgerOpTask(rc, lh, ctx)) {
            return;
        }
```

*Modification*

The timeout logic shouldn't modify the `ledgerCreated` context. It should let the callback itself to process
the `ledgerCreated` context.

* Change to use CAS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants