New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813
Comments
maybe related to #6789 |
@floragunncom does your Netty app create and destroy a lot of SslContext instances? The leak in #6789 is a bit of native memory that gets allocated when a SslContext gets created and then not freed when it gets Garbage Collected by the JVM. |
no, its not creating a lot of SslContext instances, so seems unrelated to #6789 |
|
Also did you "stop" writing when the Channel become non writable ? |
The error ist still there with 4.1.12 and tcnative 2.0.3 - will start investigating this now a bit more deeper. Its definitively related to the amount of data which is transferred. For small datasets it works well but if amount of data increases it fails. |
seems |
After removing from the command line flags But unfortunately this is not really an option for us because in production we have no control over the JVM flags and so we have to deal with
It looks like the problem was introduced in netty 4.1.8 or 4.1.9 because 4.1.7 was reported stable. Unfortunately i am not able no create a minimal reproducer but i will assemble something which will demonstrate the problem. Running netty without openssl (using Java SSL) work well for all versions and circumstances. |
Pls download https://bintray.com/floragunncom/files/download_file?file_path=netty%2Fnetty-6813-1.tar.gz and extract it. If you are on osx just run If you are on linux look in Without tcnative jar we fallback to Java SSL and all is running well. Remove The issue was originally reported here https://github.com/floragunncom/search-guard/issues/343 |
@Scottmitch @normanmaurer ping |
Looks like java/nio/Bits.java itself calls System.gc() OpenJDK source which will Curious is that i never hit this before tcnative 2.0.0. Unfortunately is just Relates JDK-8142537 |
Do you use the unpooled allocator ? Also this sounds a bit like you not correctly call ByteBuf.release() all the time. Did you try enabling the leak detector with a higher level (sorry not on the computer ATM)
… Am 11.07.2017 um 00:38 schrieb floragunn GmbH ***@***.***>:
Looks like java/nio/Bits.java itself calls System.gc() OpenJDK source which will
in case of a present -XX:+DisableExplicitGC flag just do nothing and so it appears that the direct buffers did not get garbage collected fast enough.
Curious is that i never hit this before tcnative 2.0.0. Unfortunately is just
removing -XX:+DisableExplicitGC not an option and i am running out of ideas.
Relates JDK-8142537
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Leak detector does not report any leaks (tried also in paranoid mode). I tried unpooled and pooled as well, no difference. |
The OpenJDK invokes
Note that there were some memory leaks in netty-tcnative 2.0.0 and 2.0.1. However the known leaks have been fixed in 2.0.3. |
Unsafe is disabled (-Dio.netty.noUnsafe=true), so this evaluates to USE_DIRECT_BUFFER_NO_CLEANER = false; DIRECT_MEMORY_COUNTER = null; and so it looks like I check the allocator bucket size. Regarding the overall memory: It worked pretty well with pre tcnative 2.0.0 so i assume the memory is enough and i think the bucket size too, but will double check. BTW: The relevant JVM props/flags are (we cannot change them): -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -Djna.nosys=true I also checked netty 4.1.13 with tcnative 2.0.5 but nothing changed. |
Yip I see that. Can you ensure your example runs on linux (I'm more familiar with diagnostic tools here)? I tried putting the linux tcnative jar in the same directory as the OSX jar but I am still getting an exception when starting. Please advise.
|
@Scottmitch just remove the OSX tcnative jar (you may not have two tcnative jars in the classpath) |
I tried this and it doesn't work. I've also tried removing the tcnative jar completely and same error.
|
Sorry, will double check and provide a out of the box working example for ubuntu |
@Scottmitch here is one which should work OOTB with ubuntu (as long openssl and libapr1 is installed) I have this also running on an aws ec2 instance and i'am happy to mail the ssh key to you if you like to test it there. |
I ran some diagnostics with jemalloc/jeprof and I didn't see any smoking gun at the malloc level. It looks like most of the memory sits with malloc. |
I don't know if it's relevant/useful but we happen to use Just ran into it with an unrelated project (HTTPS+netty-tcnative) where I thought I could get away with not calling An exception 'io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 64 byte(s) of direct memory (used: 2079719423, max: 2079719424)' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 64 byte(s) of direct memory (used: 2079719423, max: 2079719424)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:615) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:569) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledByteBufAllocator.java:169) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:68) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledByteBufAllocator.java:164) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:73) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:181) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:172) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102] |
A reason why you may not experience the same behavior when using the JDK SSL provider is that the JDK SSL engine doesn't require a copy at the application level to wrap heap buffers [1]. |
As stated in #6813 (comment) and https://bugs.openjdk.java.net/browse/JDK-8142537 I don't think you can rely upon the GC to collect direct memory if I have not been able to detect any leaks in your reproducer and everything seems to be working "as expected". If you are not able to control the command line arguments consider using the JDK ssl provider ... however if there is anywhere else that may use direct memory to copy data for JNI reasons (e.g. transport-native-epoll, transport-native-kqueue, etc...) you will likely run into the same issue. I will close this for now because this appears to be more related to configuration and "expected" JDK behavior (pending https://bugs.openjdk.java.net/browse/JDK-8142537). |
@Scottmitch @floragunncom I'm a bit disappointed this got closed. I am unable to use the native bindings on our ElasticSearch cluster with SearchGuard SSL because it results in Java getting OOM killed within a few minutes of startup. A JRE that normally consumes 27GB of 32GB of RAM quickly runs up to about 36GB (all available swap) and is then killed. We're taking a pretty big perf hit without it, but if the choice is crash or run slowly, I guess we'll just run slowly. |
With ES 5.5.1 (and later) this issue should be resolved because elastic/elasticsearch#25759 removes |
@floragunncom Even without DisableExplicitGC it still seems to behave poorly. Running ES 5.5.1 it got OOM killed while assigning shards, before the cluster even went green (I have about 5TB of data in ~8k shards). I added MaxDirectMemorySize=512m to try to limit it, but it still seems to leak memory and eventually get stuck in GC hell. Note the VM size:
I have tried with dynamically linked openssl from netty, your static openssl hosted on bintray, and the static boringssl from netty. Similar results. |
@brandond Which tcnative versions? |
2.0.0, 2.0.1 and 2.0.2 have known memory leaks, 2.0.3 should work, 2.0.4 is a broken release and 2.0.5 should also work. boringssl may work but we do not test it. So we recommend to use our 2.0.5 static openssl build together with ES 5.5.1 and Search Guard 14. Currently i try if i can reproduce it on aws with esrally. But so far no problems yet. |
Do you have a static OpenSSL build of a known stable release? The only static build I could find on maven was the BoringSSL build. I'm referencing this document, which it sounds like should be updated to mention the fact that massive memory leaks are to be expected unless you have the right versions of elasticsearch and netty-tcnative. https://github.com/floragunncom/search-guard-docs/blob/master/tls_openssl.md Edit: Found your 2.0.5 static build on bintray. Trying now. Anything I should collect to help diagnose the issue? |
We will update the docs soon. See http://dl.bintray.com/floragunncom/netty-tcnative/ |
Result:
jvm.options for es:
|
With ES 5.5.1? And pls. remove your extra MaxDirectMemorySize settings |
Here's the startup log, with ES and Java versions.
With netty-tcnative installed, I've never been able to get the cluster to stay up long enough to actually start making any queries. It OOMs while assigning shards. |
Just for the heck of it, I tried adding:
|
Ok, so with Java SSL it works basically? I am currently testing on AWS with the following parameters and it works without any hassle:
|
@Scottmitch @normanmaurer any ideas? |
memory lock and hostname verification are both off. You're running with 30GB heap with 61GB of RAM; I'm running with 20 of 31. I'll try dropping it down to 15GB with no MaxDirectMemorySize and memory_lock on and see if it makes any difference. Edit: memory_lock was on. Trying again with smaller heap size. |
15GB heap and no MaxDirectMemorySize: OOM killed.
|
10GB heap: no OOM kill. Seems to work?
For the record, here's what it looks like without netty-tcnative and 20gb heap:
Appears to essentially double the memory utilization? |
@floragunncom - no ideas and limited cycles at the moment. @brandond - can you provide a reproducer similar to #6813 (comment)? |
Hi.
Is there anything I can do to help solve the problem? The issue forces us to use Java SSL, but openssl is preferable. Thanks, |
io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception. Give error every day Java 11.0.2 -Xms2048m -Xmx2048m -server -verbosegc -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=1024m -Dio.netty.tryReflectionSetAccessible=false -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -server |
why was the issue closed? |
Because it's hard to reproduce under controlled circumstances and easy to work around by just not using the tcnative openssl bindings. |
Expected behavior
No java.lang.OutOfMemoryError
Actual behavior
Steps to reproduce
Minimal yet complete reproducer code (or URL to code)
Netty version
4.1.11 with tcnative 2.0.1.Final
JVM version (e.g.
java -version
)1.8.0_131
OS version (e.g.
uname -a
)Ubuntu
The text was updated successfully, but these errors were encountered: