Description
Hi,
recently I am doing some docker network research but I cannot figure out why the docker overlay network performance is so bad. I built two containers running on two VMs. I used docker VXLAN for the multi hosts connection. For the bandwidth measurement, I used iperf3.
For the VM-VM throughput, the bandwidth could be 20+Gbits/sec
[ 5] 0.00-1.00 sec 1.93 GBytes 16.6 Gbits/sec
[ 5] 1.00-2.00 sec 2.91 GBytes 25.0 Gbits/sec
[ 5] 2.00-3.00 sec 2.07 GBytes 21.8 Gbits/sec
However, the container-container throughput is just 2-3 Gbits/sec and the Retr happened a lot.
[ 4] 0.00-1.00 sec 291 MBytes 2.44 Gbits/sec 218 714 KBytes
[ 4] 1.00-2.00 sec 414 MBytes 3.47 Gbits/sec 663 942 KBytes
[ 4] 2.00-3.00 sec 384 MBytes 3.22 Gbits/sec 1182 846 KBytes
I also tested the container-container on the same VM, the throughput could also be 20+ Gbits/sec, which means the docker0 should not be the bottleneck for the multiple hosts connection.
[ 4] 0.00-1.00 sec 2.63 GBytes 22.6 Gbits/sec 328 657 KBytes
[ 4] 1.00-2.00 sec 3.14 GBytes 26.9 Gbits/sec 0 856 KBytes
[ 4] 2.00-3.00 sec 3.86 GBytes 33.2 Gbits/sec 0 856 KBytes
I also measure the CPU utilization on both client and server:
Both the client and server CPU do not use up although the VXLAN consumes extra CPU for encapsulation and decapsulation.
If the throughput cannot go up, there must be some resources limited. Could anyone give any hints why the throughput of docker overlay network is so poor?
Activity
zq-david-wang commentedon Sep 19, 2018
The cpu usage signature on server side indicates that vxlan udp traffic could only be processed by one single processor. I guess u r running a kernel with a version kind of old.
Maybe you could try upgrading kernel, or figure out how to balance the cpu usage.
(I had the similar issue with centos7.0, kernel 3.10.x, with 10Gbit/s nic, vxlan bandwidth could only reach about 2Gbit/s, after upgrading kernel to 4.x, the bandwidth improved significantly
kevinsuo commentedon Sep 19, 2018
@zq-david-wang thanks for your reply.
I also used a very new kernel 4.9. However, compared to the network without container overlay, the container network of vxlan is very bad as the above shows.
HosseinAgha commentedon Nov 25, 2019
@kevinsuo I'm experiencing the same issue except I have about 99% drop in throughput and many many tcp retransmissions.
I'm using the latest
19.03.5 Docker community edition
,Ubuntu 18.04.3
withlinux kernel 4.15.0-70-generic
.Here is the result running iperf between hosts:
When running iperf3 from 2 swarm services connected through overlay network (on the same 2 hosts):
We found this issue when we encountered very slow performance in record propagation between our database replicas.
I've already checked #35082 and #33133 but I don't think they apply here as I'm not using an encrypted overlay network and iperf is not making a lot of parallel requests. The #30768 may also be related.
I think this is a major performance issue.
HosseinAgha commentedon Nov 25, 2019
I performed the same test on a similar
19.03.5 Docker community edition
,Ubuntu 18.04.3
withlinux kernel 4.15.0-1054-aws
on an AWS instance.Here is the result running iperf between 2 hosts:
When running iperf3 from 2 swarm services connected through overlay network (on the same 2 hosts):
I don't get as much drop in throughput but the retransmission rate is still very high.
@thaJeztah I think there is a major issue in latest swarm overlay networking.
thaJeztah commentedon Nov 25, 2019
In your situation, this problem did not occur in older versions of docker in the same setup?
HosseinAgha commentedon Nov 25, 2019
No, I don't think so. We used to use docker swarm for our production servers 2 years ago and we did not have any issues.
I think that there may be something wrong with the network/instance configuration of our current cloud provider (which uses OpenStack) as the problem is less severe in AWS.
But the issue remains in any setting:
using swarm's overlay network we have drop in bandwidth + very high tcp packet retransmission rate
thaJeztah commentedon Nov 25, 2019
@arkodg any ideas?
arkodg commentedon Dec 4, 2019
The default overlay network created by docker has an MTU of 1500 which might limit BW if the host outgoing interface can support a higher MTU . Increasing the MTU of the overlay network is one knob that can be used to improve/tune network BW performance
I have a Swarm cluster with 2 nodes
Node1
Host primary interface has an MTU of 9001
Created 3
iperf
serversNode 2
Ran 3
iperf
client containers for each type of networkHosseinAgha commentedon Dec 4, 2019
Thank you @arkodg for extensive test. I think it would be awesome if you mention the need for tuning the docker overlay network MTU in the documentation https://docs.docker.com/network/overlay
I was completely clueless until now.
2 remaining items