Skip to content

[Issue 8138][pulsar-client] Fix ConsumerImpl memory leaks #8160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 30, 2020

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Sep 29, 2020

Fixes #8138

Motivation

The issue #8138 is about multiple memory leaks happening when using the Pulsar Java Client Reader API.
One of the leaks was previously fixed by #8149 .

This PR fixes two remaining ConsumerImpl memory leak issues:

  • leaks of ConsumerImpl instances via active timers (unAckedMessageTracker, acknowledgmentsGroupingTracker, stats)
  • leaks of ConsumerImpl instances via ClientCnx when the consumer gets closed by the Broker and the consumer picks a new connection to use

Modifications

  • modifications to fix the leaks of ConsumerImpl via active timers:

    • calls to "closeConsumerTasks" were added in the locations where they were missing
  • modifications to fix the leaks of ConsumerImpl via ClientCnx

    • a solution was added to track to which ClientCnx the ConsumerImpl has been registered. When the ConsumerImpl gets registered to a new ClientCnx, the same instance is unregistered from the previous ClientCnx it was previously registered to.

    • when the connection gets switched, the ConsumerImpl must be
      de-registered from the previous ClientCnx where it got registered
      previously

      • the connection gets switched with the current seek implementation
        that disconnects the consumer after a seek.
        • the new connection is picked randomly of the existing connections
          in the connection pool. Therefore reproducing the leak required
          setting connectionsPerBroker > 1.
    • attempt to cover also other cases where the ConsumerImpl instance
      wasn't de-registered properly.

      • whenever client.cleanupConsumer(this) is called, the
        de-registration from ClientCnx (where the ConsumerImpl was previously registered) should be made
      • fix a bug in closeAsync where cleanup was skipped when the channel
        was closed or an exception occurred in sending the close consumer
        request

Verifying this change

  • this PR doesn't include new tests. There is a manual test in lhotari@aff700c that has been used to reproduce the memory leak and to verify that no ConsumerImpl references are leaked after the fix has been applied.

- call closeConsumerTasks whenever consumer is discarded
@lhotari
Copy link
Member Author

lhotari commented Sep 29, 2020

/pulsarbot run-failure-checks

@lhotari lhotari closed this Sep 29, 2020
@lhotari lhotari reopened this Sep 29, 2020
@sijie sijie added this to the 2.7.0 milestone Sep 29, 2020
@lhotari lhotari force-pushed the lh-fix-consumer-impl-memory-leak branch from 9532630 to 175f03e Compare September 29, 2020 20:26
- when the connection gets switched, the ConsumerImpl must be
  de-registered from the previous ClientCnx where it got registered
  previously
  - the connection gets switched with the current seek implementation
    that disconnects the consumer after a seek.
    - the new connection is picked randomly of the existing connections
      in the connection pool. Therefore reproducing the leak required
      setting connectionsPerBroker > 1.

- attempt to cover also other cases where the ConsumerImpl instance
  wasn't de-registered properly.
  - whenever client.cleanupConsumer(this) is called, the
    de-registration should be made
  - fix a bug in closeAsync where cleanup was skipped when the channel
    was closed or an exception occurred in sending the close consumer
    request
@lhotari lhotari force-pushed the lh-fix-consumer-impl-memory-leak branch from 175f03e to 0a706fb Compare September 29, 2020 21:02
@lhotari
Copy link
Member Author

lhotari commented Sep 29, 2020

I have revisited the original PR I made earlier today. The changes are now ready for review.

It seems that the current model used to handle state and state transitions is error prone. The code gets duplicated in several locations and it's hard to be sure if it's fine to unify the behavior in all cases.
Perhaps a model where the state transition triggers "side-effects" would be a better way to implement consistent behavior? For example, when the state changes to "Closed", certain "side-effects" would be triggered.

I don't have test coverage for the changes I made. It seems that it would be possible to model consumer's state machine and test that it behaves in the correct way. However the barrier for such improvements is fairly large. :)

@lhotari
Copy link
Member Author

lhotari commented Sep 30, 2020

/pulsarbot run-failure-checks

@lhotari
Copy link
Member Author

lhotari commented Sep 30, 2020

The CI checks have passed now. @jiazhai Would you be able to review and provide feedback? Thank you

@jiazhai jiazhai merged commit 170f12c into apache:master Sep 30, 2020
@lhotari lhotari deleted the lh-fix-consumer-impl-memory-leak branch September 30, 2020 15:12
lbenc135 pushed a commit to lbenc135/pulsar that referenced this pull request Oct 3, 2020
Fixes apache#8138

### Motivation

The issue apache#8138 is about multiple memory leaks happening when using the Pulsar Java Client Reader API.  
One of the leaks was previously fixed by apache#8149 . 

This PR fixes two remaining ConsumerImpl memory leak issues:
- leaks of ConsumerImpl instances via active timers (unAckedMessageTracker, acknowledgmentsGroupingTracker, stats)
- leaks of ConsumerImpl instances via ClientCnx when the consumer gets closed by the Broker and the consumer picks a new connection to use

### Modifications

- modifications to fix the leaks of ConsumerImpl via active timers:
  - calls to "closeConsumerTasks" were added in the locations where they were missing

- modifications to fix the leaks of ConsumerImpl via ClientCnx
  - a solution was added to track to which ClientCnx the ConsumerImpl has been registered. When the ConsumerImpl gets registered to a new ClientCnx, the same instance is unregistered from the previous ClientCnx it was previously registered to.

  - when the connection gets switched, the ConsumerImpl must be
    de-registered from the previous ClientCnx where it got registered
    previously
    - the connection gets switched with the current seek implementation
      that disconnects the consumer after a seek.
      - the new connection is picked randomly of the existing connections
        in the connection pool. Therefore reproducing the leak required
        setting connectionsPerBroker > 1.

  - attempt to cover also other cases where the ConsumerImpl instance
    wasn't de-registered properly.
    - whenever client.cleanupConsumer(this) is called, the
      de-registration from ClientCnx (where the ConsumerImpl was previously registered) should be made
    - fix a bug in closeAsync where cleanup was skipped when the channel
      was closed or an exception occurred in sending the close consumer
      request


### Verifying this change

- this PR doesn't include new tests. There is a manual test in lhotari@aff700c that has been used to reproduce the memory leak and to verify that no ConsumerImpl references are leaked after the fix has been applied.
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Nov 13, 2020
Fixes apache#8138

### Motivation

The issue apache#8138 is about multiple memory leaks happening when using the Pulsar Java Client Reader API.  
One of the leaks was previously fixed by apache#8149 . 

This PR fixes two remaining ConsumerImpl memory leak issues:
- leaks of ConsumerImpl instances via active timers (unAckedMessageTracker, acknowledgmentsGroupingTracker, stats)
- leaks of ConsumerImpl instances via ClientCnx when the consumer gets closed by the Broker and the consumer picks a new connection to use

### Modifications

- modifications to fix the leaks of ConsumerImpl via active timers:
  - calls to "closeConsumerTasks" were added in the locations where they were missing

- modifications to fix the leaks of ConsumerImpl via ClientCnx
  - a solution was added to track to which ClientCnx the ConsumerImpl has been registered. When the ConsumerImpl gets registered to a new ClientCnx, the same instance is unregistered from the previous ClientCnx it was previously registered to.

  - when the connection gets switched, the ConsumerImpl must be
    de-registered from the previous ClientCnx where it got registered
    previously
    - the connection gets switched with the current seek implementation
      that disconnects the consumer after a seek.
      - the new connection is picked randomly of the existing connections
        in the connection pool. Therefore reproducing the leak required
        setting connectionsPerBroker > 1.

  - attempt to cover also other cases where the ConsumerImpl instance
    wasn't de-registered properly.
    - whenever client.cleanupConsumer(this) is called, the
      de-registration from ClientCnx (where the ConsumerImpl was previously registered) should be made
    - fix a bug in closeAsync where cleanup was skipped when the channel
      was closed or an exception occurred in sending the close consumer
      request


### Verifying this change

- this PR doesn't include new tests. There is a manual test in lhotari@aff700c that has been used to reproduce the memory leak and to verify that no ConsumerImpl references are leaked after the fix has been applied.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants