Skip to content

java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) #6813

@floragunn

Description

@floragunn

Expected behavior

No java.lang.OutOfMemoryError

Actual behavior

[2017-06-02T11:00:00,021][WARN ][c.f.s.h.SearchGuardHttpServerTransport] [GtfsfZB] caught exception while handling client http traffic, closing connection [id: 0x23038e55, L:/172.16.0.1:9200 - R:/172.16.0.1:40464]
java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:693) ~[?:1.8.0_131]
	at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) ~[?:1.8.0_131]
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_131]
	at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:764) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:740) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:226) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:146) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:324) ~[netty-buffer-4.1.11.Final.jar:4.1.11.Final]

Steps to reproduce

Minimal yet complete reproducer code (or URL to code)

Netty version

4.1.11 with tcnative 2.0.1.Final

JVM version (e.g. java -version)

1.8.0_131

OS version (e.g. uname -a)

Ubuntu

Activity

floragunn

floragunn commented on Jun 4, 2017

@floragunn
Author

maybe related to #6789

rkapsi

rkapsi commented on Jun 4, 2017

@rkapsi
Member

@floragunncom does your Netty app create and destroy a lot of SslContext instances? The leak in #6789 is a bit of native memory that gets allocated when a SslContext gets created and then not freed when it gets Garbage Collected by the JVM.

floragunn

floragunn commented on Jun 4, 2017

@floragunn
Author

no, its not creating a lot of SslContext instances, so seems unrelated to #6789

Scottmitch

Scottmitch commented on Jun 5, 2017

@Scottmitch
Member
  • Can you provide a minimal/complete reproducer?
  • Can you implement a ByteBufAllocatorMetric.java and describe the consumption of memory over time (e.g. every X seconds)
  • Do you set io.netty.maxDirectMemory or -XX:MaxDirectMemorySize and if so to what value(s)?
  • Did your application change at all?
  • Do you observe this behavior with older versions of Netty/tcnative?
normanmaurer

normanmaurer commented on Jun 6, 2017

@normanmaurer
Member

Also did you "stop" writing when the Channel become non writable ?

floragunn

floragunn commented on Jun 26, 2017

@floragunn
Author

The error ist still there with 4.1.12 and tcnative 2.0.3 - will start investigating this now a bit more deeper. Its definitively related to the amount of data which is transferred. For small datasets it works well but if amount of data increases it fails.

floragunn

floragunn commented on Jun 28, 2017

@floragunn
Author
floragunn

floragunn commented on Jul 2, 2017

@floragunn
Author

After removing from the command line flags -XX:+DisableExplicitGC the java.lang.OutOfMemoryError: Direct buffer memoryseems to be gone.

But unfortunately this is not really an option for us because in production we have no control over the JVM flags and so we have to deal with -XX:+DisableExplicitGC

io.netty.maxDirectMemory is not set explicitly and -XX:MaxDirectMemorySize is as much as Xmx (in our cases normally somewhere between 4 and 64Gb)

It looks like the problem was introduced in netty 4.1.8 or 4.1.9 because 4.1.7 was reported stable. Unfortunately i am not able no create a minimal reproducer but i will assemble something which will demonstrate the problem.

Running netty without openssl (using Java SSL) work well for all versions and circumstances.

floragunn

floragunn commented on Jul 2, 2017

@floragunn
Author

Pls download https://bintray.com/floragunncom/files/download_file?file_path=netty%2Fnetty-6813-1.tar.gz and extract it.

If you are on osx just run bin/elasticsearch and wait a few seconds than run in another terminal ./load_sampledata.sh unitl you see a java.lang.OutOfMemoryError: Direct buffer memory in your first console (normally after one or two times).

If you are on linux look in plugin/search-guard-5 and replace the tcnative jar with the one for linux.

Without tcnative jar we fallback to Java SSL and all is running well. Remove -XX:+DisableExplicitGC from config/jvm.options and the OutOfMemoryError should not show up. If you increase -Xmx or -XX:MaxDirectMemorySize run ./load_sampledata.sh just a few time more and you see the error and the JVM dies.

The issue was originally reported here https://github.com/floragunncom/search-guard/issues/343

floragunn

floragunn commented on Jul 10, 2017

@floragunn
Author
floragunn

floragunn commented on Jul 10, 2017

@floragunn
Author

Looks like java/nio/Bits.java itself calls System.gc() OpenJDK source which will
in case of a present -XX:+DisableExplicitGC flag just do nothing and so it appears that the direct buffers did not get garbage collected fast enough.

Curious is that i never hit this before tcnative 2.0.0. Unfortunately is just
removing -XX:+DisableExplicitGC not an option and i am running out of ideas.

Relates JDK-8142537

normanmaurer

normanmaurer commented on Jul 11, 2017

@normanmaurer
Member
floragunn

floragunn commented on Jul 11, 2017

@floragunn
Author

Leak detector does not report any leaks (tried also in paranoid mode). I tried unpooled and pooled as well, no difference.

Scottmitch

Scottmitch commented on Jul 11, 2017

@Scottmitch
Member

The OpenJDK invokes System.gc() during direct ByteBuffer allocation to provide a hint and hope for timely reclamation of directly memory by the GC (because these allocations are off heap). Using a properly tuned pooled allocator should reduce/remove the need to call ByteBuffer for allocations, and we even try to bypass the ByteBuffer allocation path for performance reasons by default (controlled via io.netty.maxDirectMemory [1]). It sounds like this occurs during a burst of activity ... are you tuning the pooled allocator bucket sizes for your application, and are you sure you are allowed for enough memory to be allocated for the increased load of traffic (how did you calculate the max)?

Curious is that i never hit this before tcnative 2.0.0

Note that there were some memory leaks in netty-tcnative 2.0.0 and 2.0.1. However the known leaks have been fixed in 2.0.3.

[1] https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L149

19 remaining items

brandond

brandond commented on Aug 1, 2017

@brandond

Result:

[304057.604728] Out of memory: Kill process 9162 (java) score 971 or sacrifice child
[304057.608374] Killed process 9162 (java) total-vm:35956548kB, anon-rss:30360260kB, file-rss:124132kB, shmem-rss:0kB
[304058.094253] oom_reaper: reaped process 9162 (java), now anon-rss:21584260kB, file-rss:124536kB, shmem-rss:0kB

jvm.options for es:

-Xms20442m
-Xmx20442m
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+AlwaysPreTouch
-server
-Xss1m
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Djdk.io.permissionsUseCanonicalPath=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError
floragunn

floragunn commented on Aug 1, 2017

@floragunn
Author

With ES 5.5.1?
What is your java version and vendor?
Do you make big bulk requests? (just read that it happens even before cluster is green)

And pls. remove your extra MaxDirectMemorySize settings

brandond

brandond commented on Aug 1, 2017

@brandond

Here's the startup log, with ES and Java versions.

[2017-08-01T21:04:03,826][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] initializing ...
[2017-08-01T21:04:03,908][INFO ][o.e.e.NodeEnvironment    ] [i-04b5958c79e89bd20] using [1] data paths, mounts [[/mnt/elasticsearch (/dev/xvdb)]], net us
able_space [947.7gb], net total_space [1.9tb], spins? [no], types [ext4]
[2017-08-01T21:04:03,908][INFO ][o.e.e.NodeEnvironment    ] [i-04b5958c79e89bd20] heap size [19.9gb], compressed ordinary object pointers [true]
[2017-08-01T21:04:04,901][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] node name [i-04b5958c79e89bd20], node ID [00QQBdGZSXeAdf8xsH9R5g]
[2017-08-01T21:04:04,902][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] version[5.5.1], pid[9162], build[19c13d0/2017-07-18T20:44:24.823Z], OS[Linux/4.9.38-16.33.amzn1.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_141/25.141-b16]
[2017-08-01T21:04:04,902][INFO ][o.e.n.Node               ] [i-04b5958c79e89bd20] JVM arguments [-Xms20442m, -Xmx20442m, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -Djdk.io.permissionsUseCanonicalPath=true, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j.skipJansi=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.path.home=/usr/share/elasticsearch]
[2017-08-01T21:04:05,965][INFO ][c.f.s.s.SearchGuardSSLPlugin] Consider setting -Djdk.tls.rejectClientInitiatedRenegotiation=true to prevent DoS attacks through client side initiated TLS renegotiation.
[2017-08-01T21:04:06,321][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] Open SSL OpenSSL 1.0.2l  25 May 2017 (268443855) available
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.version: 1.8.0_141
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.version: 1.8
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.specification.name: Java Virtual Machine Specification
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.name: OpenJDK 64-Bit Server VM
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.vm.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.version: 1.8
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.vendor: Oracle Corporation
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] java.specification.name: Java Platform API Specification
[2017-08-01T21:04:06,322][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.name: Linux
[2017-08-01T21:04:06,323][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.arch: amd64
[2017-08-01T21:04:06,323][INFO ][c.f.s.s.DefaultSearchGuardKeyStore] os.version: 4.9.38-16.33.amzn1.x86_64

With netty-tcnative installed, I've never been able to get the cluster to stay up long enough to actually start making any queries. It OOMs while assigning shards.

brandond

brandond commented on Aug 1, 2017

@brandond

Just for the heck of it, I tried adding: -XX:MaxDirectMemorySize=512m to jvm.options. If I do this, it does not immediately get OOM killed. It does however chew through a ton of VM and eventually end up stalled in GC hell before I have to shut it down manually.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12254 elastics  20   0  544g  25g 4.5g S 61.9 86.7   6:21.11 java
floragunn

floragunn commented on Aug 1, 2017

@floragunn
Author

Ok, so with Java SSL it works basically?
Have you set "bootstrap.memory_lock: true" in elasticsearch.yml?
Have you disabled hostname verification?

I am currently testing on AWS with the following parameters and it works without any hassle:

  • 3 x r3.2xlarge, 30 GB heap, Ubuntu 16.04 (Kernel 4.4.0)
  • ES 5.5.1, SG 14, Oracle JVM 1.8.0_131/25.131-b11, memory_lock true
  • max file desc = 1000000
  • Instance SSD storage
  • OpenSSL 1.0.2g, tcnative 2.0.5 dynamic compiled version from netty maven repo
  • No hostname verification at all enabled
  • HTTPS enabled
  • 120 mio docs, 116 shards, 20gb total size
floragunn

floragunn commented on Aug 1, 2017

@floragunn
Author
brandond

brandond commented on Aug 1, 2017

@brandond

memory lock and hostname verification are both off. You're running with 30GB heap with 61GB of RAM; I'm running with 20 of 31. I'll try dropping it down to 15GB with no MaxDirectMemorySize and memory_lock on and see if it makes any difference.

Edit: memory_lock was on. Trying again with smaller heap size.

brandond

brandond commented on Aug 1, 2017

@brandond

15GB heap and no MaxDirectMemorySize: OOM killed.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15289 elastics  20   0 34.0g  28g 126m S 145.7 95.3   3:21.36 /usr/bin/java -Xms15332m -Xmx15332m
brandond

brandond commented on Aug 1, 2017

@brandond

10GB heap: no OOM kill. Seems to work?

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
16678 elastics  20   0  617g  19g 9.1g S 84.2 66.6   5:32.69 /usr/bin/java -Xms10221m -Xmx10221m

For the record, here's what it looks like without netty-tcnative and 20gb heap:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
18327 elastics  20   0 1065g  26g 5.6g S 17.3 88.4   9:50.35 /usr/bin/java -Xms20442m -Xmx20442m 

Appears to essentially double the memory utilization?

Scottmitch

Scottmitch commented on Aug 2, 2017

@Scottmitch
Member

@floragunncom - no ideas and limited cycles at the moment.

@brandond - can you provide a reproducer similar to #6813 (comment)?

jansimak

jansimak commented on Dec 8, 2017

@jansimak

Hi.
I can confirm it's still happening. During heavy bulk indexing or searching the heap goes up and never drops back.
Environment: heap size 30GB, jvm without -XX:+DisableExplicitGC, openjdk-8-jre-headless=8u141-b15-1~deb9u1, debian stretch, elasticsearch=5.6.0, netty-tcnative-openssl-1.0.2m-static-2.0.7

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
6147 elastic+  20   0  0.137t 0.078t 146116 S  22.2 84.1  26:09.21 java

Is there anything I can do to help solve the problem? The issue forces us to use Java SSL, but openssl is preferable.

Thanks,
Honza

fbasar

fbasar commented on May 6, 2019

@fbasar

io.netty.channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.OutOfMemoryError: Direct buffer memory
at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)
at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:769)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:745)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:244)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:226)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:146)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:332)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:185)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:176)
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:137)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:114)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:147)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:677)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)

Give error every day

Java 11.0.2
Netty 4.1.35.Final
TCNative 2.0.25.Final

-Xms2048m -Xmx2048m -server -verbosegc -XX:+HeapDumpOnOutOfMemoryError -XX:MaxDirectMemorySize=1024m -Dio.netty.tryReflectionSetAccessible=false -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -server

mrkk

mrkk commented on May 29, 2019

@mrkk

why was the issue closed?

brandond

brandond commented on May 29, 2019

@brandond

Because it's hard to reproduce under controlled circumstances and easy to work around by just not using the tcnative openssl bindings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @rkapsi@brandond@normanmaurer@mrkk@Scottmitch

        Issue actions

          java.lang.OutOfMemoryError: Direct buffer memory (with tcnative) · Issue #6813 · netty/netty