Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arp_cache: neighbor table overflow! #4533

Closed
felipejfc opened this issue Feb 27, 2018 · 19 comments
Closed

arp_cache: neighbor table overflow! #4533

felipejfc opened this issue Feb 27, 2018 · 19 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@felipejfc
Copy link
Contributor

  1. What kops version are you running? The command kops version, will display
    this information.
    1.8

  2. What Kubernetes version are you running? kubectl version will print the
    version if a cluster is running or provide the Kubernetes version specified as
    a kops flag.
    1.8.6

  3. What cloud provider are you using?
    aws

I'm having a log of arp_cache table overflow in my production cluster, reading this blog post about large clusters: https://blog.openai.com/scaling-kubernetes-to-2500-nodes/ they say that the solution is increasing the maximum size of the arp cache table, can I configure sysctl options:

net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh3

using kops?

thanks!

@chrislovecnm
Copy link
Contributor

Using a hook sure!

@chrislovecnm
Copy link
Contributor

@felipejfc
Copy link
Contributor Author

thanks @chrislovecnm, don't you think default max arp table size should be greater though? My cluster isn't even so bigger, it has ~60 nodes and ~2000 ~3000 pods I guess... what would be the downsize on permiting a larger arp table?

@chrislovecnm
Copy link
Contributor

I would check with sig node. Really a kernel question.

@justinsb
Copy link
Member

So I think we want to keep gc_thresh1 at 0: kubernetes/kubernetes#23395

I don't see a lot of problems in raising gc_thresh1 and gc_thresh2, but I'm not sure whether we should do this across the board automatically. I think it probably depends on your networking mode, in that I think modes that tunnel traffic will only use a single ARP entry per node, whereas modes that don't tunnel will use an ARP entry per pod. But I'm not really sure. Which network mode are you using @felipejfc ?

@felipejfc
Copy link
Contributor Author

@justinsb you must be right, I use calico

@felipejfc
Copy link
Contributor Author

for future reference if someout needs it, I've used the following hook:

  hooks:
  - manifest: |
      Type=oneshot
      ExecStart=/sbin/sysctl net.ipv4.neigh.default.gc_thresh3=8192 ; /sbin/sysctl net.ipv4.neigh.default.gc_thresh2=4096 ; /sbin/sysctl -p
    name: increase-neigh-gc-thresh.service

@felipejfc
Copy link
Contributor Author

this was causing real damage to my cluster, I've saw a big performance improvements on several services in the cluster after increasing gc_thresh3 and gc_thresh2

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Mar 2, 2018

@caseydavenport any comments?

@caseydavenport
Copy link
Member

Seems sensible to me - using Calico each node will have an arp entry for each pod running on that node, so if you've got high pod density / pod churn adjusting makes sense.

I think it probably depends on your networking mode, in that I think modes that tunnel traffic will only use a single ARP entry per node,

I think it's probably bridged vs not bridged that makes the difference here. If all the pods are on a bridge the host will only need a single ARP entry, but for routed pods the host will need an ARP entry for each.

@felipejfc
Copy link
Contributor Author

@caseydavenport I guess every node will also have ARP entry for each of the pods running on other nodes as well, right? at least the ones the communicate with?

@caseydavenport
Copy link
Member

caseydavenport commented Mar 2, 2018

guess every node will also have ARP entry for each of the pods running on other nodes as well, right? at least the ones the communicate with?

No, it shouldn't have one for every pod because the nodes themselves are the next hops for traffic, not individual pod IPs. Instead, you'll get an ARP entry for each node in the cluster. So, a given node's ARP cache should roughly be num_pods_on_that_node + num_nodes_in_cluster.

@felipejfc
Copy link
Contributor Author

felipejfc commented Mar 2, 2018

@caseydavenport is it possible that calico is never cleaning nodes that were deleted from the cluster from the ARP table? I'm using AWS and this seems to be the case, take a look:

admin@ip-172-20-152-253:~$ sudo arp -an
? (172.20.136.122) at <incomplete> on eth0
? (172.20.151.251) at <incomplete> on eth0
? (172.20.157.194) at <incomplete> on eth0
? (172.20.135.175) at <incomplete> on eth0
? (172.20.149.88) at <incomplete> on eth0
? (172.20.133.12) at <incomplete> on eth0
? (172.20.150.190) at <incomplete> on eth0
? (172.20.142.212) at <incomplete> on eth0
? (172.20.147.156) at <incomplete> on eth0
? (172.20.135.149) at <incomplete> on eth0
? (172.20.142.193) at <incomplete> on eth0
? (172.20.140.166) at <incomplete> on eth0
? (172.20.159.68) at <incomplete> on eth0
? (172.20.154.197) at <incomplete> on eth0
? (172.20.143.104) at <incomplete> on eth0
? (172.20.158.14) at <incomplete> on eth0
? (172.20.134.69) at <incomplete> on eth0
? (172.20.145.91) at <incomplete> on eth0
? (172.20.148.118) at <incomplete> on eth0
? (172.20.132.26) at <incomplete> on eth0
? (172.20.138.215) at <incomplete> on eth0
? (172.20.129.120) at <incomplete> on eth0
? (172.20.147.224) at <incomplete> on eth0
? (172.20.133.77) at 0e:c4:88:17:6e:66 [ether] on eth0
? (172.20.157.124) at <incomplete> on eth0
? (172.20.140.234) at <incomplete> on eth0
? (172.20.147.221) at <incomplete> on eth0
? (172.20.158.82) at <incomplete> on eth0
? (172.20.135.202) at <incomplete> on eth0
? (172.20.138.61) at <incomplete> on eth0
? (172.20.145.178) at <incomplete> on eth0
? (172.20.148.73) at <incomplete> on eth0
? (172.20.158.79) at <incomplete> on eth0
? (172.20.150.229) at <incomplete> on eth0
? (172.20.137.197) at <incomplete> on eth0
? (172.20.141.14) at <incomplete> on eth0
? (172.20.148.186) at <incomplete> on eth0
? (172.20.159.246) at <incomplete> on eth0
? (172.20.154.119) at <incomplete> on eth0
? (172.20.146.141) at <incomplete> on eth0
? (172.20.153.2) at <incomplete> on eth0
? (172.20.133.145) at <incomplete> on eth0
? (172.20.151.121) at <incomplete> on eth0
? (172.20.154.96) at <incomplete> on eth0
? (172.20.128.22) at <incomplete> on eth0
? (172.20.138.20) at <incomplete> on eth0
? (172.20.135.45) at <incomplete> on eth0
? (172.20.157.64) at <incomplete> on eth0
? (172.20.130.162) at <incomplete> on eth0
? (172.20.152.193) at <incomplete> on eth0
? (172.20.134.247) at <incomplete> on eth0
? (172.20.131.213) at <incomplete> on eth0
? (172.20.150.60) at <incomplete> on eth0
...
...

I use cluster autoscaler and there are nodes being started and deleted all time, this is the output from a machine thats only 4 hours old and it has 1174 entries in arp table despite my cluster only having like 60 nodes... and there are this ips that seems to be ips of brokers that are not alive anymore and stays with incomplete state.

@caseydavenport
Copy link
Member

@felipejfc while Calico isn't responsible for modifying the ARP table directly, I suspect this is a result of the same root cause as this issue: #3224

Basically Calico node configuration isn't getting cleaned up when nodes go away, so Calico will continue to try to access those nodes and thus will create a bunch of ARP entries which it can't complete (since the nodes are no longer there).

Adding in the node controller to kops should fix this as well.

felipejfc added a commit to felipejfc/kops that referenced this issue Mar 6, 2018
felipejfc added a commit to felipejfc/kops that referenced this issue Mar 6, 2018
felipejfc added a commit to felipejfc/kops that referenced this issue Mar 6, 2018
vendrov pushed a commit to vendrov/kops that referenced this issue Mar 21, 2018
rdrgmnzs pushed a commit to rdrgmnzs/kops that referenced this issue Apr 6, 2018
@alienth
Copy link

alienth commented Apr 18, 2018

The kernel should be expiring stale arp entries. Seems like the bug which was referenced in https://forums.aws.amazon.com/thread.jspa?messageID=572171 wasn't actually forwarded onto the kernel devs?

I think the ideal kernel behaviour would be to GC entries down to the minimum, but GC beyond that for stale entries.

@alienth
Copy link

alienth commented Apr 18, 2018

Worth noting: If you're getting neighbor table overflow! log entries, this indicates that even after a synchronous GC of the ARP table, there was not enough room to store the neighbour entry. In this event the kernel just drops the packet entirely. The specific threshold you're hitting there is gc_thresh3.

Perf issues abound when this happens because the neighbour table is locked while the synchronous GC is performed. As such, you'll definitely want to ensure that gc_thresh3 is far higher than what you expect your ARP table to be.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2018
@caseydavenport
Copy link
Member

Can this be closed now?

@felipejfc
Copy link
Contributor Author

sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

7 participants