Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813

Closed
floragunn opened this issue Jun 4, 2017 · 46 comments
Closed

java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813

floragunn opened this issue Jun 4, 2017 · 46 comments

Comments

@floragunn
Copy link
Contributor

Expected behavior

No java.lang.OutOfMemoryError

Actual behavior

[2017-06-02T11:00:00,021][WARN ][c.f.s.h.SearchGuardHttpServerTransport] [GtfsfZB] caught exception while handling client http traffic, closing connection [id: 0x23038e55, L:/172.16.0.1:9200 - R:/172.16.0.1:40464]
java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:693) ~[?:1.8.0_131]
	at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) ~[?:1.8.0_131]
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_131]
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

4.1.11 with tcnative 2.0.1.Final

JVM version (e.g. java -version)

1.8.0_131

OS version (e.g. uname -a)

Ubuntu

@floragunn
Copy link
Contributor Author

maybe related to #6789

@rkapsi
Copy link
Member

rkapsi commented Jun 4, 2017

@floragunncom does your Netty app create and destroy a lot of SslContext instances? The leak in #6789 is a bit of native memory that gets allocated when a SslContext gets created and then not freed when it gets Garbage Collected by the JVM.

@floragunn
Copy link
Contributor Author

no, its not creating a lot of SslContext instances, so seems unrelated to #6789

@Scottmitch
Copy link
Member

Scottmitch commented Jun 5, 2017

  • Can you provide a minimal/complete reproducer?
  • Can you implement a ByteBufAllocatorMetric.java and describe the consumption of memory over time (e.g. every X seconds)
  • Do you set io.netty.maxDirectMemory or -XX:MaxDirectMemorySize and if so to what value(s)?
  • Did your application change at all?
  • Do you observe this behavior with older versions of Netty/tcnative?

@normanmaurer
Copy link
Member

Also did you "stop" writing when the Channel become non writable ?

@floragunn
Copy link
Contributor Author

The error ist still there with 4.1.12 and tcnative 2.0.3 - will start investigating this now a bit more deeper. Its definitively related to the amount of data which is transferred. For small datasets it works well but if amount of data increases it fails.

@floragunn
Copy link
Contributor Author

floragunn commented Jun 28, 2017

seems -XX:+DisableExplicitGC is causing the trouble (https://stackoverflow.com/questions/32912702/impact-of-setting-xxdisableexplicitgc-when-nio-direct-buffers-are-used) ... report back soon

@floragunn
Copy link
Contributor Author

After removing from the command line flags -XX:+DisableExplicitGC the java.lang.OutOfMemoryError: Direct buffer memoryseems to be gone.

But unfortunately this is not really an option for us because in production we have no control over the JVM flags and so we have to deal with -XX:+DisableExplicitGC

io.netty.maxDirectMemory is not set explicitly and -XX:MaxDirectMemorySize is as much as Xmx (in our cases normally somewhere between 4 and 64Gb)

It looks like the problem was introduced in netty 4.1.8 or 4.1.9 because 4.1.7 was reported stable. Unfortunately i am not able no create a minimal reproducer but i will assemble something which will demonstrate the problem.

Running netty without openssl (using Java SSL) work well for all versions and circumstances.

@floragunn
Copy link
Contributor Author

Pls download https://bintray.com/floragunncom/files/download_file?file_path=netty%2Fnetty-6813-1.tar.gz and extract it.

If you are on osx just run bin/elasticsearch and wait a few seconds than run in another terminal ./load_sampledata.sh unitl you see a java.lang.OutOfMemoryError: Direct buffer memory in your first console (normally after one or two times).

If you are on linux look in plugin/search-guard-5 and replace the tcnative jar with the one for linux.

Without tcnative jar we fallback to Java SSL and all is running well. Remove -XX:+DisableExplicitGC from config/jvm.options and the OutOfMemoryError should not show up. If you increase -Xmx or -XX:MaxDirectMemorySize run ./load_sampledata.sh just a few time more and you see the error and the JVM dies.

The issue was originally reported here https://github.com/floragunncom/search-guard/issues/343

@floragunn
Copy link
Contributor Author

@Scottmitch @normanmaurer ping

@floragunn
Copy link
Contributor Author

Looks like java/nio/Bits.java itself calls System.gc() OpenJDK source which will
in case of a present -XX:+DisableExplicitGC flag just do nothing and so it appears that the direct buffers did not get garbage collected fast enough.

Curious is that i never hit this before tcnative 2.0.0. Unfortunately is just
removing -XX:+DisableExplicitGC not an option and i am running out of ideas.

Relates JDK-8142537

@normanmaurer
Copy link
Member

normanmaurer commented Jul 11, 2017 via email

@floragunn
Copy link
Contributor Author

Leak detector does not report any leaks (tried also in paranoid mode). I tried unpooled and pooled as well, no difference.

@Scottmitch
Copy link
Member

The OpenJDK invokes System.gc() during direct ByteBuffer allocation to provide a hint and hope for timely reclamation of directly memory by the GC (because these allocations are off heap). Using a properly tuned pooled allocator should reduce/remove the need to call ByteBuffer for allocations, and we even try to bypass the ByteBuffer allocation path for performance reasons by default (controlled via io.netty.maxDirectMemory [1]). It sounds like this occurs during a burst of activity ... are you tuning the pooled allocator bucket sizes for your application, and are you sure you are allowed for enough memory to be allocated for the increased load of traffic (how did you calculate the max)?

Curious is that i never hit this before tcnative 2.0.0

Note that there were some memory leaks in netty-tcnative 2.0.0 and 2.0.1. However the known leaks have been fixed in 2.0.3.

[1] https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L149

@floragunn
Copy link
Contributor Author

Unsafe is disabled (-Dio.netty.noUnsafe=true), so this evaluates to

USE_DIRECT_BUFFER_NO_CLEANER = false;
DIRECT_MEMORY_COUNTER = null;

and so it looks like io.netty.maxDirectMemory has therefore no effect.

I check the allocator bucket size. Regarding the overall memory: It worked pretty well with pre tcnative 2.0.0 so i assume the memory is enough and i think the bucket size too, but will double check.

BTW: The relevant JVM props/flags are (we cannot change them):

-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+AlwaysPreTouch
-Djna.nosys=true

I also checked netty 4.1.13 with tcnative 2.0.5 but nothing changed.

@Scottmitch
Copy link
Member

Yip I see that. Can you ensure your example runs on linux (I'm more familiar with diagnostic tools here)? I tried putting the linux tcnative jar in the same directory as the OSX jar but I am still getting an exception when starting. Please advise.

netty-6813$ ls -al ./plugins/search-guard-5/netty-tcnative*
-rw-rw-r-- 1 scott scott 62745 Jul 11 17:17 ./plugins/search-guard-5/netty-tcnative-2.0.3.Final-linux-x86_64.jar
-rw-r--r-- 1 scott scott 49579 Jun  7 09:19 ./plugins/search-guard-5/netty-tcnative-2.0.3.Final-osx-x86_64.jar
scott@scott-lubuntu:~/Downloads/netty-6813/bin$ ./elasticsearch
[2017-07-11T17:21:43,918][INFO ][o.e.n.Node               ] [] initializing ...
[2017-07-11T17:21:43,969][INFO ][o.e.e.NodeEnvironment    ] [TxbF8jB] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [34.5gb], net total_space [51.4gb], spins? [no], types [ext4]
[2017-07-11T17:21:43,970][INFO ][o.e.e.NodeEnvironment    ] [TxbF8jB] heap size [990.7mb], compressed ordinary object pointers [true]
[2017-07-11T17:21:43,974][INFO ][o.e.n.Node               ] node name [TxbF8jB] derived from node ID [TxbF8jB7TreNsJf4FMWILw]; set [node.name] to override
[2017-07-11T17:21:43,974][INFO ][o.e.n.Node               ] version[5.4.2], pid[7153], build[929b078/2017-06-15T02:29:28.122Z], OS[Linux/4.4.0-83-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_112/25.112-b15]
[2017-07-11T17:21:43,975][INFO ][o.e.n.Node               ] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+DisableExplicitGC, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/home/scott/Downloads/netty-6813]
[2017-07-11T17:21:44,442][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to load plugin - Plugin information:
Name: search-guard-5
Description: Provide access control related features for Elasticsearch 5
Version: 5.4.2-13
Native Controller: false
 * Classname: com.floragunn.searchguard.SearchGuardPlugin due to jar hell
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:127) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:114) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:67) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:122) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.cli.Command.main(Command.java:88) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.4.2.jar:5.4.2]
Caused by: java.lang.IllegalStateException: failed to load plugin - Plugin information:
Name: search-guard-5
Description: Provide access control related features for Elasticsearch 5
Version: 5.4.2-13
Native Controller: false
 * Classname: com.floragunn.searchguard.SearchGuardPlugin due to jar hell
	at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:372) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:139) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:310) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:242) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.4.2.jar:5.4.2]
	... 6 more
Caused by: java.util.zip.ZipException: error in opening zip file
	at java.util.zip.ZipFile.open(Native Method) ~[?:1.8.0_112]
	at java.util.zip.ZipFile.<init>(ZipFile.java:219) ~[?:1.8.0_112]
	at java.util.zip.ZipFile.<init>(ZipFile.java:149) ~[?:1.8.0_112]
	at java.util.jar.JarFile.<init>(JarFile.java:166) ~[?:1.8.0_112]
	at java.util.jar.JarFile.<init>(JarFile.java:103) ~[?:1.8.0_112]
	at org.elasticsearch.bootstrap.JarHell.checkJarHell(JarHell.java:180) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:370) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.plugins.PluginsService.<init>(PluginsService.java:139) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:310) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.node.Node.<init>(Node.java:242) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:232) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:350) ~[elasticsearch-5.4.2.jar:5.4.2]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:123) ~[elasticsearch-5.4.2.jar:5.4.2]
	... 6 more

@floragunn
Copy link
Contributor Author

@Scottmitch just remove the OSX tcnative jar (you may not have two tcnative jars in the classpath)

@Scottmitch
Copy link
Member

Scottmitch commented Jul 13, 2017

I tried this and it doesn't work. I've also tried removing the tcnative jar completely and same error.

Ubuntu 16.10
4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

@floragunn
Copy link
Contributor Author

Sorry, will double check and provide a out of the box working example for ubuntu

@floragunn
Copy link
Contributor Author

@Scottmitch here is one which should work OOTB with ubuntu (as long openssl and libapr1 is installed)
https://bintray.com/floragunncom/files/download_file?file_path=n3%2Fnetty-6813-ubuntu3.tar.gz

I have this also running on an aws ec2 instance and i'am happy to mail the ssh key to you if you like to test it there.

@Scottmitch
Copy link
Member

I ran some diagnostics with jemalloc/jeprof and I didn't see any smoking gun at the malloc level. It looks like most of the memory sits with malloc. updatewindow, init, and readCEN retain some memory too but the memory doesn't appear to change as -XX:MaxDirectMemorySize is increased. As you said in #6813 (comment) you are disabling unsafe so all the direct memory should be managed by the GC (even if there are ByteBuf leaks the memory should be reclaimed ... eventually). I need to look in more detail at the heap dumps to see where memory is being retained...

@rkapsi
Copy link
Member

rkapsi commented Jul 17, 2017

I don't know if it's relevant/useful but we happen to use -XX:+UseConcMarkSweepGC as well and in my experience it's very easy to make it OOME with direct memory. It's sometimes not compelled to collect.

Just ran into it with an unrelated project (HTTPS+netty-tcnative) where I thought I could get away with not calling #release().

An exception 'io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 64 byte(s) of direct memory (used: 2079719423, max: 2079719424)' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 64 byte(s) of direct memory (used: 2079719423, max: 2079719424)
    at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:615) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:569) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledByteBufAllocator.java:169) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledUnsafeDirectByteBuf.<init>(UnpooledUnsafeDirectByteBuf.java:68) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledUnsafeNoCleanerDirectByteBuf.java:25) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.<init>(UnpooledByteBufAllocator.java:164) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:73) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:181) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:172) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:80) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:122) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ~[netty-all-4.1.13.Final.jar:4.1.13.Final]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_102]

@Scottmitch
Copy link
Member

A reason why you may not experience the same behavior when using the JDK SSL provider is that the JDK SSL engine doesn't require a copy at the application level to wrap heap buffers [1].

[1] https://github.com/netty/netty/blob/4.1/handler/src/main/java/io/netty/handler/ssl/SslHandler.java#L922

@Scottmitch
Copy link
Member

As stated in #6813 (comment) and https://bugs.openjdk.java.net/browse/JDK-8142537 I don't think you can rely upon the GC to collect direct memory if -XX:+DisableExplicitGC is enabled. Relying on upon ByteBuffer.allocateDirect's memory enforcement which depends upon explicit GC hints is being defeated by disabling the explicit GC hints. Unless the JDK is updated to ignore -XX:+DisableExplicitGC for direct memory allocation I think OOME is expected behavior from the JDK.

I have not been able to detect any leaks in your reproducer and everything seems to be working "as expected". If you are not able to control the command line arguments consider using the JDK ssl provider ... however if there is anywhere else that may use direct memory to copy data for JNI reasons (e.g. transport-native-epoll, transport-native-kqueue, etc...) you will likely run into the same issue. I will close this for now because this appears to be more related to configuration and "expected" JDK behavior (pending https://bugs.openjdk.java.net/browse/JDK-8142537).

@brandond
Copy link

@Scottmitch @floragunncom I'm a bit disappointed this got closed. I am unable to use the native bindings on our ElasticSearch cluster with SearchGuard SSL because it results in Java getting OOM killed within a few minutes of startup. A JRE that normally consumes 27GB of 32GB of RAM quickly runs up to about 36GB (all available swap) and is then killed. We're taking a pretty big perf hit without it, but if the choice is crash or run slowly, I guess we'll just run slowly.

@floragunn
Copy link
Contributor Author

With ES 5.5.1 (and later) this issue should be resolved because elastic/elasticsearch#25759 removes -XX:+DisableExplicitGC from the ES default config (commit details). If you are on earlier versions just remove -XX:+DisableExplicitGC manually from jvm.options

@brandond
Copy link

brandond commented Jul 31, 2017

@floragunncom Even without DisableExplicitGC it still seems to behave poorly. Running ES 5.5.1 it got OOM killed while assigning shards, before the cluster even went green (I have about 5TB of data in ~8k shards). I added MaxDirectMemorySize=512m to try to limit it, but it still seems to leak memory and eventually get stuck in GC hell. Note the VM size:

top - 21:49:58 up 2 days, 13:12,  1 user,  load average: 2.69, 4.16, 2.77
Tasks: 109 total,   1 running, 108 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.2%us,  0.0%sy,  0.0%ni, 74.5%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  31401100k total, 31167608k used,   233492k free,   212284k buffers
Swap:        0k total,        0k used,        0k free,  7102328k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3961 elastics  20   0  986g  26g 5.3g S 99.9 89.2  10:54.72 java

I have tried with dynamically linked openssl from netty, your static openssl hosted on bintray, and the static boringssl from netty. Similar results.

@floragunn
Copy link
Contributor Author

@brandond Which tcnative versions?

@floragunn
Copy link
Contributor Author

2.0.0, 2.0.1 and 2.0.2 have known memory leaks, 2.0.3 should work, 2.0.4 is a broken release and 2.0.5 should also work. boringssl may work but we do not test it.

So we recommend to use our 2.0.5 static openssl build together with ES 5.5.1 and Search Guard 14. Currently i try if i can reproduce it on aws with esrally. But so far no problems yet.

@brandond
Copy link

brandond commented Aug 1, 2017

Do you have a static OpenSSL build of a known stable release? The only static build I could find on maven was the BoringSSL build. I'm referencing this document, which it sounds like should be updated to mention the fact that massive memory leaks are to be expected unless you have the right versions of elasticsearch and netty-tcnative.

https://github.com/floragunncom/search-guard-docs/blob/master/tls_openssl.md

Edit: Found your 2.0.5 static build on bintray. Trying now. Anything I should collect to help diagnose the issue?

@floragunn
Copy link
Contributor Author

We will update the docs soon. See http://dl.bintray.com/floragunncom/netty-tcnative/

@brandond
Copy link

brandond commented Aug 1, 2017

Result:

[304057.604728] Out of memory: Kill process 9162 (java) score 971 or sacrifice child
[304057.608374] Killed process 9162 (java) total-vm:35956548kB, anon-rss:30360260kB, file-rss:124132kB, shmem-rss:0kB
[304058.094253] oom_reaper: reaped process 9162 (java), now anon-rss:21584260kB, file-rss:124536kB, shmem-rss:0kB

jvm.options for es:

-Xms20442m
-Xmx20442m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Djdk.io.permissionsUseCanonicalPath=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError

@floragunn
Copy link
Contributor Author

floragunn commented Aug 1, 2017

With ES 5.5.1?
What is your java version and vendor?
Do you make big bulk requests? (just read that it happens even before cluster is green)

And pls. remove your extra MaxDirectMemorySize settings

@brandond
Copy link

brandond commented Aug 1, 2017

Here's the startup log, with ES and Java versions.

[2017-08-01T21:04:03,826][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] initializing ...
[2017-08-01T21:04:03,908][INFO ][o.e.e.NodeEnvironment    ] [i-04b5958c79e89bd20] using [1] data paths, mounts [[/mnt/elasticsearch (/dev/xvdb)]], net us
able_space [947.7gb], net total_space [1.9tb], spins? [no], types [ext4]
[2017-08-01T21:04:03,908][INFO ][o.e.e.NodeEnvironment    ] [i-04b5958c79e89bd20] heap size [19.9gb], compressed ordinary object pointers [true]
[2017-08-01T21:04:04,901][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] node name [i-04b5958c79e89bd20], node ID [00QQBdGZSXeAdf8xsH9R5g]
[2017-08-01T21:04:04,902][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] version[5.5.1], pid[9162], build[19c13d0/2017-07-18T20:44:24.823Z], OS[Linux/4.9.38-16.33.amzn1.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_141/25.141-b16]
[2017-08-01T21:04:04,902][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] JVM arguments [-Xms20442m, -Xmx20442m, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/usr/share/elasticsearch]
[2017-08-01T21:04:05,965][INFO ][c.f.s.s.SearchGuardSSLPlugin] Consider setting -Djdk.tls.rejectClientInitiatedRenegotiation=true to prevent DoS attacks through client side initiated TLS renegotiation.
[2017-08-01T21:04:06,321][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] Open SSL OpenSSL 1.0.2l  25 May 2017 (268443855) available
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.version: 1.8.0_141
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.version: 1.8
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.name: Java Virtual Machine Specification
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.name: OpenJDK 64-Bit Server VM
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.version: 1.8
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.name: Java Platform API Specification
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.name: Linux
[2017-08-01T21:04:06,323][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.arch: amd64
[2017-08-01T21:04:06,323][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.version: 4.9.38-16.33.amzn1.x86_64

With netty-tcnative installed, I've never been able to get the cluster to stay up long enough to actually start making any queries. It OOMs while assigning shards.

@brandond
Copy link

brandond commented Aug 1, 2017

Just for the heck of it, I tried adding: -XX:MaxDirectMemorySize=512m to jvm.options. If I do this, it does not immediately get OOM killed. It does however chew through a ton of VM and eventually end up stalled in GC hell before I have to shut it down manually.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12254 elastics  20   0  544g  25g 4.5g S 61.9 86.7   6:21.11 java

@floragunn
Copy link
Contributor Author

Ok, so with Java SSL it works basically?
Have you set "bootstrap.memory_lock: true" in elasticsearch.yml?
Have you disabled hostname verification?

I am currently testing on AWS with the following parameters and it works without any hassle:

  • 3 x r3.2xlarge, 30 GB heap, Ubuntu 16.04 (Kernel 4.4.0)
  • ES 5.5.1, SG 14, Oracle JVM 1.8.0_131/25.131-b11, memory_lock true
  • max file desc = 1000000
  • Instance SSD storage
  • OpenSSL 1.0.2g, tcnative 2.0.5 dynamic compiled version from netty maven repo
  • No hostname verification at all enabled
  • HTTPS enabled
  • 120 mio docs, 116 shards, 20gb total size

@floragunn
Copy link
Contributor Author

@Scottmitch @normanmaurer any ideas?

@brandond
Copy link

brandond commented Aug 1, 2017

memory lock and hostname verification are both off. You're running with 30GB heap with 61GB of RAM; I'm running with 20 of 31. I'll try dropping it down to 15GB with no MaxDirectMemorySize and memory_lock on and see if it makes any difference.

Edit: memory_lock was on. Trying again with smaller heap size.

@brandond
Copy link

brandond commented Aug 1, 2017

15GB heap and no MaxDirectMemorySize: OOM killed.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15289 elastics  20   0 34.0g  28g 126m S 145.7 95.3   3:21.36 /usr/bin/java -Xms15332m -Xmx15332m

@brandond
Copy link

brandond commented Aug 1, 2017

10GB heap: no OOM kill. Seems to work?

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16678 elastics  20   0  617g  19g 9.1g S 84.2 66.6   5:32.69 /usr/bin/java -Xms10221m -Xmx10221m

For the record, here's what it looks like without netty-tcnative and 20gb heap:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18327 elastics  20   0 1065g  26g 5.6g S 17.3 88.4   9:50.35 /usr/bin/java -Xms20442m -Xmx20442m 

Appears to essentially double the memory utilization?

@Scottmitch
Copy link
Member

@floragunncom - no ideas and limited cycles at the moment.

@brandond - can you provide a reproducer similar to #6813 (comment)?

@jansimak
Copy link

jansimak commented Dec 8, 2017

Hi.
I can confirm it's still happening. During heavy bulk indexing or searching the heap goes up and never drops back.
Environment: heap size 30GB, jvm without -XX:+DisableExplicitGC, openjdk-8-jre-headless=8u141-b15-1~deb9u1, debian stretch, elasticsearch=5.6.0, netty-tcnative-openssl-1.0.2m-static-2.0.7

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
6147 elastic+  20   0  0.137t 0.078t 146116 S  22.2 84.1  26:09.21 java

Is there anything I can do to help solve the problem? The issue forces us to use Java SSL, but openssl is preferable.

Thanks,
Honza

@fbasar
Copy link

fbasar commented May 6, 2019

io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:769)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:745)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:332)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)

Give error every day

Java 11.0.2
Netty 4.1.35.Final
TCNative 2.0.25.Final

-Xms2048m -Xmx2048m -server -verbosegc -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=1024m -Dio.netty.tryReflectionSetAccessible=false -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -server

@mrkk
Copy link

mrkk commented May 29, 2019

why was the issue closed?

@brandond
Copy link

Because it's hard to reproduce under controlled circumstances and easy to work around by just not using the tcnative openssl bindings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants