Skip to content

pytorch-engine:0.18.0 causes memory leak when using NDManager.newBaseManager() #1886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
925781609 opened this issue Aug 10, 2022 · 1 comment · Fixed by #1888
Closed

pytorch-engine:0.18.0 causes memory leak when using NDManager.newBaseManager() #1886

925781609 opened this issue Aug 10, 2022 · 1 comment · Fixed by #1888
Labels
bug Something isn't working

Comments

@925781609
Copy link
Contributor

925781609 commented Aug 10, 2022

Description

  1. When using pytorch-engine:0.18.0 NDManager.newBaseManager() creates a PtNDManager, it will call ai.djl.pytorch.engine.PtNDManager#newSubManager, and execute:
   PtNDManager manager = new PtNDManager(this, device);
   attachUncappedInternal(manager.uid, manager);
   return manager;
  1. Method attachUncappedInternal is implemented by BaseNDManager and attaches the created PtNDManager to its field resources.
    resources.put(resourceId, resource);
  1. The created PtNDManger will never be released even it is closed.
   public void close() {
        if (!closed.getAndSet(true)) {
             // ignore some code
            parent.detachInternal(uid);
            resources.clear();
            tempResources.clear();
        }
    }

The parent is PtNDManager$SystemManager and parent's detachInternal does nothing.

@Override
 public void detachInternal(String resourceId) {}

So in the end, the created PtNDManger will not be sweeped by JVM GC.

  1. When downgrade pytorch-engine to version 0.17.0, the problem is solved. Because the newSubManager calls PtNDManager$SystemManger#attachInternal. PtNDManager$SystemManger#attachInternal does nothing.
   PtNDManager manager = new PtNDManager(this, device);
    attachInternal(manager.uid, manager);
    return manager;
 @Override
  public void attachInternal(String resourceId, AutoCloseable resource) {}

Expected Behavior

The SystemManager will not attach the created PtNDManger to its field resources or release PtNDManger when it is closed.

Error Message

image

How to Reproduce?

  1. use pytorch-engine version 0.18.0
  2. execute the code below as many times as possible and will cause OOM eventually.
try (NDManager manager = NDManager.newBaseManager(Device.cpu())) {
    // do something here
}
  1. maven dependencies
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-engine</artifactId>
            <version>0.18.0</version>
            <exclusions>
                <exclusion>
                    <artifactId>jna</artifactId>
                    <groupId>net.java.dev.jna</groupId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>5.9.0</version>
        </dependency>

        <!--For Pre-CXX11 build -->
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cpu-precxx11</artifactId>
            <classifier>linux-x86_64</classifier>
            <version>1.11.0</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.11.0-0.18.0</version>
            <scope>runtime</scope>
        </dependency>
        <!-- windows -->
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-native-cpu</artifactId>
            <classifier>win-x86_64</classifier>
            <scope>runtime</scope>
            <version>1.11.0</version>
        </dependency>
        <dependency>
            <groupId>ai.djl.pytorch</groupId>
            <artifactId>pytorch-jni</artifactId>
            <version>1.11.0-0.18.0</version>
            <scope>runtime</scope>
        </dependency
@925781609 925781609 added the bug Something isn't working label Aug 10, 2022
925781609 pushed a commit to 925781609/djl that referenced this issue Aug 10, 2022
zachgk added a commit to zachgk/djl that referenced this issue Aug 10, 2022
fixes deepjavalibrary#1886

This creates an interface for SystemNDManagers. That way, the behavior of
skipping over various functions can be moved to the BaseNDManager instead of
each individual SystemManager
@lanking520
Copy link
Contributor

Thanks for your fix contribution, we will track on that

frankfliu added a commit that referenced this issue Aug 10, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
* [fix]: fix memory leak when using NDManager.newBaseManager() create NDManager (#1886)

* Fixes compile issue for other engines

Change-Id: Id034fb17ecf918be381c725deeb557e528d3ef65

Co-authored-by: Liu,Yang <yliu37@trip.com>
Co-authored-by: Frank Liu <frankfliu2000@gmail.com>
patins1 pushed a commit to patins1/djl that referenced this issue Aug 26, 2022
…ary#1887)

* [fix]: fix memory leak when using NDManager.newBaseManager() create NDManager (deepjavalibrary#1886)

* Fixes compile issue for other engines

Change-Id: Id034fb17ecf918be381c725deeb557e528d3ef65

Co-authored-by: Liu,Yang <yliu37@trip.com>
Co-authored-by: Frank Liu <frankfliu2000@gmail.com>
zachgk added a commit to zachgk/djl that referenced this issue Sep 9, 2022
fixes deepjavalibrary#1886

This creates an interface for SystemNDManagers. That way, the behavior of
skipping over various functions can be moved to the BaseNDManager instead of
each individual SystemManager
zachgk added a commit that referenced this issue Sep 12, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
* Creates a SystemNDManager interface

fixes #1886

This creates an interface for SystemNDManagers. That way, the behavior of
skipping over various functions can be moved to the BaseNDManager instead of
each individual SystemManager

* Updated LightGBM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants