java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813

floragunn · 2017-06-04T12:26:41Z

Expected behavior

No java.lang.OutOfMemoryError

Actual behavior

[2017-06-02T11:00:00,021][WARN ][c.f.s.h.SearchGuardHttpServerTransport] [GtfsfZB] caught exception while handling client http traffic, closing connection [id: 0x23038e55, L:/172.16.0.1:9200 - R:/172.16.0.1:40464]
java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:693) ~[?:1.8.0_131]
	at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) ~[?:1.8.0_131]
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_131]
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

4.1.11 with tcnative 2.0.1.Final

JVM version (e.g. `java -version`)

1.8.0_131

OS version (e.g. `uname -a`)

Ubuntu

The text was updated successfully, but these errors were encountered:

floragunn · 2017-06-04T12:27:03Z

maybe related to #6789

rkapsi · 2017-06-04T14:02:06Z

@floragunncom does your Netty app create and destroy a lot of SslContext instances? The leak in #6789 is a bit of native memory that gets allocated when a SslContext gets created and then not freed when it gets Garbage Collected by the JVM.

floragunn · 2017-06-04T16:17:34Z

no, its not creating a lot of SslContext instances, so seems unrelated to #6789

Scottmitch · 2017-06-05T20:36:31Z

Can you provide a minimal/complete reproducer?
Can you implement a ByteBufAllocatorMetric.java and describe the consumption of memory over time (e.g. every X seconds)
Do you set io.netty.maxDirectMemory or -XX:MaxDirectMemorySize and if so to what value(s)?
Did your application change at all?
Do you observe this behavior with older versions of Netty/tcnative?

normanmaurer · 2017-06-06T18:31:06Z

Also did you "stop" writing when the Channel become non writable ?

floragunn · 2017-06-26T08:28:54Z

The error ist still there with 4.1.12 and tcnative 2.0.3 - will start investigating this now a bit more deeper. Its definitively related to the amount of data which is transferred. For small datasets it works well but if amount of data increases it fails.

floragunn · 2017-06-28T10:21:38Z

seems -XX:+DisableExplicitGC is causing the trouble (https://stackoverflow.com/questions/32912702/impact-of-setting-xxdisableexplicitgc-when-nio-direct-buffers-are-used) ... report back soon

floragunn · 2017-07-02T08:42:34Z

After removing from the command line flags -XX:+DisableExplicitGC the java.lang.OutOfMemoryError: Direct buffer memoryseems to be gone.

But unfortunately this is not really an option for us because in production we have no control over the JVM flags and so we have to deal with -XX:+DisableExplicitGC

io.netty.maxDirectMemory is not set explicitly and -XX:MaxDirectMemorySize is as much as Xmx (in our cases normally somewhere between 4 and 64Gb)

It looks like the problem was introduced in netty 4.1.8 or 4.1.9 because 4.1.7 was reported stable. Unfortunately i am not able no create a minimal reproducer but i will assemble something which will demonstrate the problem.

Running netty without openssl (using Java SSL) work well for all versions and circumstances.

floragunn · 2017-07-02T09:29:39Z

Pls download https://bintray.com/floragunncom/files/download_file?file_path=netty%2Fnetty-6813-1.tar.gz and extract it.

If you are on osx just run bin/elasticsearch and wait a few seconds than run in another terminal ./load_sampledata.sh unitl you see a java.lang.OutOfMemoryError: Direct buffer memory in your first console (normally after one or two times).

If you are on linux look in plugin/search-guard-5 and replace the tcnative jar with the one for linux.

Without tcnative jar we fallback to Java SSL and all is running well. Remove -XX:+DisableExplicitGC from config/jvm.options and the OutOfMemoryError should not show up. If you increase -Xmx or -XX:MaxDirectMemorySize run ./load_sampledata.sh just a few time more and you see the error and the JVM dies.

The issue was originally reported here https://github.com/floragunncom/search-guard/issues/343

floragunn · 2017-07-10T18:02:44Z

@Scottmitch @normanmaurer ping

floragunn · 2017-07-10T22:37:58Z

Looks like java/nio/Bits.java itself calls System.gc() OpenJDK source which will
in case of a present -XX:+DisableExplicitGC flag just do nothing and so it appears that the direct buffers did not get garbage collected fast enough.

Curious is that i never hit this before tcnative 2.0.0. Unfortunately is just
removing -XX:+DisableExplicitGC not an option and i am running out of ideas.

Relates JDK-8142537

normanmaurer · 2017-07-11T05:53:57Z

Do you use the unpooled allocator ? Also this sounds a bit like you not correctly call ByteBuf.release() all the time. Did you try enabling the leak detector with a higher level (sorry not on the computer ATM)

…

Am 11.07.2017 um 00:38 schrieb floragunn GmbH ***@***.***>: Looks like java/nio/Bits.java itself calls System.gc() OpenJDK source which will in case of a present -XX:+DisableExplicitGC flag just do nothing and so it appears that the direct buffers did not get garbage collected fast enough. Curious is that i never hit this before tcnative 2.0.0. Unfortunately is just removing -XX:+DisableExplicitGC not an option and i am running out of ideas. Relates JDK-8142537 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

floragunn · 2017-07-11T08:09:07Z

Leak detector does not report any leaks (tried also in paranoid mode). I tried unpooled and pooled as well, no difference.

Scottmitch · 2017-07-11T16:35:18Z

The OpenJDK invokes System.gc() during direct ByteBuffer allocation to provide a hint and hope for timely reclamation of directly memory by the GC (because these allocations are off heap). Using a properly tuned pooled allocator should reduce/remove the need to call ByteBuffer for allocations, and we even try to bypass the ByteBuffer allocation path for performance reasons by default (controlled via io.netty.maxDirectMemory [1]). It sounds like this occurs during a burst of activity ... are you tuning the pooled allocator bucket sizes for your application, and are you sure you are allowed for enough memory to be allocated for the increased load of traffic (how did you calculate the max)?

Curious is that i never hit this before tcnative 2.0.0

Note that there were some memory leaks in netty-tcnative 2.0.0 and 2.0.1. However the known leaks have been fixed in 2.0.3.

[1] https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L149

floragunn · 2017-07-11T20:36:05Z

Unsafe is disabled (-Dio.netty.noUnsafe=true), so this evaluates to

USE_DIRECT_BUFFER_NO_CLEANER = false;
DIRECT_MEMORY_COUNTER = null;

and so it looks like io.netty.maxDirectMemory has therefore no effect.

I check the allocator bucket size. Regarding the overall memory: It worked pretty well with pre tcnative 2.0.0 so i assume the memory is enough and i think the bucket size too, but will double check.

BTW: The relevant JVM props/flags are (we cannot change them):

-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+AlwaysPreTouch
-Djna.nosys=true

I also checked netty 4.1.13 with tcnative 2.0.5 but nothing changed.

Scottmitch · 2017-07-12T00:26:31Z

Yip I see that. Can you ensure your example runs on linux (I'm more familiar with diagnostic tools here)? I tried putting the linux tcnative jar in the same directory as the OSX jar but I am still getting an exception when starting. Please advise.

netty-6813$ ls -al ./plugins/search-guard-5/netty-tcnative*
-rw-rw-r-- 1 scott scott 62745 Jul 11 17:17 ./plugins/search-guard-5/netty-tcnative-2.0.3.Final-linux-x86_64.jar
-rw-r--r-- 1 scott scott 49579 Jun  7 09:19 ./plugins/search-guard-5/netty-tcnative-2.0.3.Final-osx-x86_64.jar

scott@scott-lubuntu:~/Downloads/netty-6813/bin$ ./elasticsearch
[2017-07-11T17:21:43,918][INFO ][o.e.n.Node               ] [] initializing ...
[2017-07-11T17:21:43,969][INFO ][o.e.e.NodeEnvironment    ] [TxbF8jB] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [34.5gb], net total_space [51.4gb], spins? [no], types [ext4]
[2017-07-11T17:21:43,970][INFO ][o.e.e.NodeEnvironment    ] [TxbF8jB] heap size [990.7mb], compressed ordinary object pointers [true]
[2017-07-11T17:21:43,974][INFO ][o.e.n.Node               ] node name [TxbF8jB] derived from node ID [TxbF8jB7TreNsJf4FMWILw]; set [node.name] to override
[2017-07-11T17:21:43,974][INFO ][o.e.n.Node               ] version[5.4.2], pid[7153], build[929b078/2017-06-15T02:29:28.122Z], OS[Linux/4.4.0-83-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_112/25.112-b15]
[2017-07-11T17:21:43,975][INFO ][o.e.n.Node               ] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/home/scott/Downloads/netty-6813]
[2017-07-11T17:21:44,442][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to load plugin - Plugin information:
Name: search-guard-5
Description: Provide access control related features for Elasticsearch 5
Version: 5.4.2-13
Native Controller: false
 * Classname: com.floragunn.searchguard.SearchGuardPlugin due to jar hell
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:127) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.cli.Command.main(Command.java:88) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.4.2.jar:5.4.2]
Caused by: java.lang.IllegalStateException: failed to load plugin - Plugin information:
Name: search-guard-5
Description: Provide access control related features for Elasticsearch 5
Version: 5.4.2-13
Native Controller: false
 * Classname: com.floragunn.searchguard.SearchGuardPlugin due to jar hell
	at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:372) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:139) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:310) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:242) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.4.2.jar:5.4.2]
	... 6 more
Caused by: java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_112]
	at java.util.zip.ZipFile.<init>(ZipFile.java:219) ~[?:1.8.0_112]
	at java.util.zip.ZipFile.<init>(ZipFile.java:149) ~[?:1.8.0_112]
	at java.util.jar.JarFile.<init>(JarFile.java:166) ~[?:1.8.0_112]
	at java.util.jar.JarFile.<init>(JarFile.java:103) ~[?:1.8.0_112]
	at org.elasticsearch.bootstrap.JarHell.checkJarHell(JarHell.java:180) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:370) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:139) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:310) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:242) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.4.2.jar:5.4.2]
	... 6 more

floragunn · 2017-07-12T08:12:42Z

@Scottmitch just remove the OSX tcnative jar (you may not have two tcnative jars in the classpath)

Scottmitch · 2017-07-13T00:13:26Z

I tried this and it doesn't work. I've also tried removing the tcnative jar completely and same error.

Ubuntu 16.10
4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

floragunn · 2017-07-13T07:37:00Z

Sorry, will double check and provide a out of the box working example for ubuntu

floragunn · 2017-07-13T21:43:52Z

@Scottmitch here is one which should work OOTB with ubuntu (as long openssl and libapr1 is installed)
https://bintray.com/floragunncom/files/download_file?file_path=n3%2Fnetty-6813-ubuntu3.tar.gz

I have this also running on an aws ec2 instance and i'am happy to mail the ssh key to you if you like to test it there.

Scottmitch · 2017-07-17T18:54:24Z

I ran some diagnostics with jemalloc/jeprof and I didn't see any smoking gun at the malloc level. It looks like most of the memory sits with malloc. updatewindow, init, and readCEN retain some memory too but the memory doesn't appear to change as -XX:MaxDirectMemorySize is increased. As you said in #6813 (comment) you are disabling unsafe so all the direct memory should be managed by the GC (even if there are ByteBuf leaks the memory should be reclaimed ... eventually). I need to look in more detail at the heap dumps to see where memory is being retained...

rkapsi · 2017-07-17T20:49:59Z

I don't know if it's relevant/useful but we happen to use -XX:+UseConcMarkSweepGC as well and in my experience it's very easy to make it OOME with direct memory. It's sometimes not compelled to collect.

Just ran into it with an unrelated project (HTTPS+netty-tcnative) where I thought I could get away with not calling #release().

An exception 'io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 64 byte(s) of direct memory (used: 2079719423, max: 2079719424)' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 64 byte(s) of direct memory (used: 2079719423, max: 2079719424)
    at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:615) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:569) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledByteBufAllocator.java:169) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:68) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledByteBufAllocator.java:164) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:73) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:181) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:172) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]

Scottmitch · 2017-07-17T23:07:57Z

A reason why you may not experience the same behavior when using the JDK SSL provider is that the JDK SSL engine doesn't require a copy at the application level to wrap heap buffers [1].

[1] https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L922

Scottmitch · 2017-07-17T23:25:15Z

As stated in #6813 (comment) and https://bugs.openjdk.java.net/browse/JDK-8142537 I don't think you can rely upon the GC to collect direct memory if -XX:+DisableExplicitGC is enabled. Relying on upon ByteBuffer.allocateDirect's memory enforcement which depends upon explicit GC hints is being defeated by disabling the explicit GC hints. Unless the JDK is updated to ignore -XX:+DisableExplicitGC for direct memory allocation I think OOME is expected behavior from the JDK.

I have not been able to detect any leaks in your reproducer and everything seems to be working "as expected". If you are not able to control the command line arguments consider using the JDK ssl provider ... however if there is anywhere else that may use direct memory to copy data for JNI reasons (e.g. transport-native-epoll, transport-native-kqueue, etc...) you will likely run into the same issue. I will close this for now because this appears to be more related to configuration and "expected" JDK behavior (pending https://bugs.openjdk.java.net/browse/JDK-8142537).

brandond · 2017-07-31T20:48:28Z

@Scottmitch @floragunncom I'm a bit disappointed this got closed. I am unable to use the native bindings on our ElasticSearch cluster with SearchGuard SSL because it results in Java getting OOM killed within a few minutes of startup. A JRE that normally consumes 27GB of 32GB of RAM quickly runs up to about 36GB (all available swap) and is then killed. We're taking a pretty big perf hit without it, but if the choice is crash or run slowly, I guess we'll just run slowly.

floragunn · 2017-07-31T20:52:25Z

With ES 5.5.1 (and later) this issue should be resolved because elastic/elasticsearch#25759 removes -XX:+DisableExplicitGC from the ES default config (commit details). If you are on earlier versions just remove -XX:+DisableExplicitGC manually from jvm.options

brandond · 2017-07-31T22:12:48Z

@floragunncom Even without DisableExplicitGC it still seems to behave poorly. Running ES 5.5.1 it got OOM killed while assigning shards, before the cluster even went green (I have about 5TB of data in ~8k shards). I added MaxDirectMemorySize=512m to try to limit it, but it still seems to leak memory and eventually get stuck in GC hell. Note the VM size:

top - 21:49:58 up 2 days, 13:12,  1 user,  load average: 2.69, 4.16, 2.77
Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.2%us,  0.0%sy,  0.0%ni, 74.5%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  31401100k total, 31167608k used,   233492k free,   212284k buffers
Swap:        0k total,        0k used,        0k free,  7102328k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3961 elastics  20   0  986g  26g 5.3g S 99.9 89.2  10:54.72 java

I have tried with dynamically linked openssl from netty, your static openssl hosted on bintray, and the static boringssl from netty. Similar results.

floragunn · 2017-08-01T10:31:48Z

@brandond Which tcnative versions?

brandond · 2017-08-01T16:22:17Z

@floragunncom
http://repo1.maven.org/maven2/io/netty/netty-tcnative/2.0.1.Final/netty-tcnative-2.0.1.Final-linux-x86_64-fedora.jar
https://bintray.com/floragunncom/netty-tcnative/download_file?file_path=netty-tcnative-openssl-1.0.2l-static-2.0.1.Final-fedora-linux-x86_64.jar
http://central.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/2.0.1.Final/netty-tcnative-boringssl-static-2.0.1.Final-linux-x86_64.jar

also tried 2.0.5 of boringssl just for kicks:
http://central.maven.org/maven2/io/netty/netty-tcnative-boringssl-static/2.0.5.Final/netty-tcnative-boringssl-static-2.0.5.Final-linux-x86_64.jar

floragunn · 2017-08-01T19:36:10Z

2.0.0, 2.0.1 and 2.0.2 have known memory leaks, 2.0.3 should work, 2.0.4 is a broken release and 2.0.5 should also work. boringssl may work but we do not test it.

So we recommend to use our 2.0.5 static openssl build together with ES 5.5.1 and Search Guard 14. Currently i try if i can reproduce it on aws with esrally. But so far no problems yet.

brandond · 2017-08-01T20:54:54Z

Do you have a static OpenSSL build of a known stable release? The only static build I could find on maven was the BoringSSL build. I'm referencing this document, which it sounds like should be updated to mention the fact that massive memory leaks are to be expected unless you have the right versions of elasticsearch and netty-tcnative.

https://github.com/floragunncom/search-guard-docs/blob/master/tls_openssl.md

Edit: Found your 2.0.5 static build on bintray. Trying now. Anything I should collect to help diagnose the issue?

floragunn · 2017-08-01T21:01:18Z

We will update the docs soon. See http://dl.bintray.com/floragunncom/netty-tcnative/

brandond · 2017-08-01T21:13:58Z

Result:

[304057.604728] Out of memory: Kill process 9162 (java) score 971 or sacrifice child
[304057.608374] Killed process 9162 (java) total-vm:35956548kB, anon-rss:30360260kB, file-rss:124132kB, shmem-rss:0kB
[304058.094253] oom_reaper: reaped process 9162 (java), now anon-rss:21584260kB, file-rss:124536kB, shmem-rss:0kB

jvm.options for es:

-Xms20442m
-Xmx20442m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Djdk.io.permissionsUseCanonicalPath=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError

floragunn · 2017-08-01T21:18:09Z

With ES 5.5.1?
What is your java version and vendor?
Do you make big bulk requests? (just read that it happens even before cluster is green)

And pls. remove your extra MaxDirectMemorySize settings

brandond · 2017-08-01T21:29:21Z

Here's the startup log, with ES and Java versions.

[2017-08-01T21:04:03,826][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] initializing ...
[2017-08-01T21:04:03,908][INFO ][o.e.e.NodeEnvironment    ] [i-04b5958c79e89bd20] using [1] data paths, mounts [[/mnt/elasticsearch (/dev/xvdb)]], net us
able_space [947.7gb], net total_space [1.9tb], spins? [no], types [ext4]
[2017-08-01T21:04:03,908][INFO ][o.e.e.NodeEnvironment    ] [i-04b5958c79e89bd20] heap size [19.9gb], compressed ordinary object pointers [true]
[2017-08-01T21:04:04,901][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] node name [i-04b5958c79e89bd20], node ID [00QQBdGZSXeAdf8xsH9R5g]
[2017-08-01T21:04:04,902][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] version[5.5.1], pid[9162], build[19c13d0/2017-07-18T20:44:24.823Z], OS[Linux/4.9.38-16.33.amzn1.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_141/25.141-b16]
[2017-08-01T21:04:04,902][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] JVM arguments [-Xms20442m, -Xmx20442m, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/usr/share/elasticsearch]
[2017-08-01T21:04:05,965][INFO ][c.f.s.s.SearchGuardSSLPlugin] Consider setting -Djdk.tls.rejectClientInitiatedRenegotiation=true to prevent DoS attacks through client side initiated TLS renegotiation.
[2017-08-01T21:04:06,321][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] Open SSL OpenSSL 1.0.2l  25 May 2017 (268443855) available
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.version: 1.8.0_141
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.version: 1.8
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.name: Java Virtual Machine Specification
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.name: OpenJDK 64-Bit Server VM
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.version: 1.8
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.name: Java Platform API Specification
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.name: Linux
[2017-08-01T21:04:06,323][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.arch: amd64
[2017-08-01T21:04:06,323][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.version: 4.9.38-16.33.amzn1.x86_64

With netty-tcnative installed, I've never been able to get the cluster to stay up long enough to actually start making any queries. It OOMs while assigning shards.

brandond · 2017-08-01T21:39:24Z

Just for the heck of it, I tried adding: -XX:MaxDirectMemorySize=512m to jvm.options. If I do this, it does not immediately get OOM killed. It does however chew through a ton of VM and eventually end up stalled in GC hell before I have to shut it down manually.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12254 elastics  20   0  544g  25g 4.5g S 61.9 86.7   6:21.11 java

floragunn · 2017-08-01T21:43:12Z

Ok, so with Java SSL it works basically?
Have you set "bootstrap.memory_lock: true" in elasticsearch.yml?
Have you disabled hostname verification?

I am currently testing on AWS with the following parameters and it works without any hassle:

3 x r3.2xlarge, 30 GB heap, Ubuntu 16.04 (Kernel 4.4.0)
ES 5.5.1, SG 14, Oracle JVM 1.8.0_131/25.131-b11, memory_lock true
max file desc = 1000000
Instance SSD storage
OpenSSL 1.0.2g, tcnative 2.0.5 dynamic compiled version from netty maven repo
No hostname verification at all enabled
HTTPS enabled
120 mio docs, 116 shards, 20gb total size

floragunn · 2017-08-01T21:43:59Z

@Scottmitch @normanmaurer any ideas?

brandond · 2017-08-01T21:49:48Z

memory lock and hostname verification are both off. You're running with 30GB heap with 61GB of RAM; I'm running with 20 of 31. I'll try dropping it down to 15GB with no MaxDirectMemorySize and memory_lock on and see if it makes any difference.

Edit: memory_lock was on. Trying again with smaller heap size.

brandond · 2017-08-01T21:56:46Z

15GB heap and no MaxDirectMemorySize: OOM killed.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15289 elastics  20   0 34.0g  28g 126m S 145.7 95.3   3:21.36 /usr/bin/java -Xms15332m -Xmx15332m

brandond · 2017-08-01T22:03:08Z

10GB heap: no OOM kill. Seems to work?

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16678 elastics  20   0  617g  19g 9.1g S 84.2 66.6   5:32.69 /usr/bin/java -Xms10221m -Xmx10221m

For the record, here's what it looks like without netty-tcnative and 20gb heap:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18327 elastics  20   0 1065g  26g 5.6g S 17.3 88.4   9:50.35 /usr/bin/java -Xms20442m -Xmx20442m

Appears to essentially double the memory utilization?

Scottmitch · 2017-08-02T15:50:29Z

@floragunncom - no ideas and limited cycles at the moment.

@brandond - can you provide a reproducer similar to #6813 (comment)?

jansimak · 2017-12-08T15:35:16Z

Hi.
I can confirm it's still happening. During heavy bulk indexing or searching the heap goes up and never drops back.
Environment: heap size 30GB, jvm without -XX:+DisableExplicitGC, openjdk-8-jre-headless=8u141-b15-1~deb9u1, debian stretch, elasticsearch=5.6.0, netty-tcnative-openssl-1.0.2m-static-2.0.7

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
6147 elastic+  20   0  0.137t 0.078t 146116 S  22.2 84.1  26:09.21 java

Is there anything I can do to help solve the problem? The issue forces us to use Java SSL, but openssl is preferable.

Thanks,
Honza

fbasar · 2019-05-06T13:20:02Z

io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:769)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:745)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:332)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)

Give error every day

Java 11.0.2
Netty 4.1.35.Final
TCNative 2.0.25.Final

-Xms2048m -Xmx2048m -server -verbosegc -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=1024m -Dio.netty.tryReflectionSetAccessible=false -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -server

mrkk · 2019-05-29T18:07:20Z

why was the issue closed?

brandond · 2019-05-29T20:36:03Z

Because it's hard to reproduce under controlled circumstances and easy to work around by just not using the tcnative openssl bindings.

Scottmitch added the needs info label Jun 5, 2017

Scottmitch closed this as completed Jul 17, 2017

dapengzhang0 mentioned this issue Apr 8, 2020

java.lang.OutOfMemoryError: Direct buffer memory grpc/grpc-java#6910

Closed

java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813

java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813

Comments

floragunn commented Jun 4, 2017

Expected behavior

Actual behavior

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

JVM version (e.g. java -version)

OS version (e.g. uname -a)

floragunn commented Jun 4, 2017

rkapsi commented Jun 4, 2017

floragunn commented Jun 4, 2017

Scottmitch commented Jun 5, 2017 • edited

normanmaurer commented Jun 6, 2017

floragunn commented Jun 26, 2017

floragunn commented Jun 28, 2017 • edited

floragunn commented Jul 2, 2017

floragunn commented Jul 2, 2017

floragunn commented Jul 10, 2017

floragunn commented Jul 10, 2017

normanmaurer commented Jul 11, 2017 via email

floragunn commented Jul 11, 2017

Scottmitch commented Jul 11, 2017

floragunn commented Jul 11, 2017

Scottmitch commented Jul 12, 2017

floragunn commented Jul 12, 2017

Scottmitch commented Jul 13, 2017 • edited

floragunn commented Jul 13, 2017

floragunn commented Jul 13, 2017

Scottmitch commented Jul 17, 2017

rkapsi commented Jul 17, 2017

Scottmitch commented Jul 17, 2017

Scottmitch commented Jul 17, 2017

brandond commented Jul 31, 2017

floragunn commented Jul 31, 2017

brandond commented Jul 31, 2017 • edited

floragunn commented Aug 1, 2017

brandond commented Aug 1, 2017

floragunn commented Aug 1, 2017

brandond commented Aug 1, 2017 • edited

floragunn commented Aug 1, 2017

brandond commented Aug 1, 2017

floragunn commented Aug 1, 2017 • edited

brandond commented Aug 1, 2017

brandond commented Aug 1, 2017

floragunn commented Aug 1, 2017

floragunn commented Aug 1, 2017

brandond commented Aug 1, 2017 • edited

brandond commented Aug 1, 2017

brandond commented Aug 1, 2017 • edited

Scottmitch commented Aug 2, 2017

jansimak commented Dec 8, 2017

fbasar commented May 6, 2019

mrkk commented May 29, 2019

brandond commented May 29, 2019

JVM version (e.g. `java -version`)

OS version (e.g. `uname -a`)

Scottmitch commented Jun 5, 2017 •

edited

floragunn commented Jun 28, 2017 •

edited

Scottmitch commented Jul 13, 2017 •

edited

brandond commented Jul 31, 2017 •

edited

brandond commented Aug 1, 2017 •

edited

floragunn commented Aug 1, 2017 •

edited

brandond commented Aug 1, 2017 •

edited

brandond commented Aug 1, 2017 •

edited