-
Notifications
You must be signed in to change notification settings - Fork 40.6k
CLOSE_WAIT connections on master node when ELBs point there #43212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you for the excellent report! So my theory is this:
For 3: I confirmed that during the time a service had no pods, the
And kube-proxy was listening on 31445 (the NodePort):
Also, the number of CLOSE_WAIT connections went up during the time period when I restarted the pod in my service. I confirmed that 172.20.104.161 and 172.20.108.40 are the IP addresses of my ELBs. They are doing TCP health checks every 5s (IIRC). It is also possible that the health check is an unusual TCP pattern (because it is not an HTTP health check; it merely opens and closes the connection). For this particular issue, which was about the master, my suspicion is that the same will happen on the nodes, in that the kube-proxy configurations should be the same. If we actually know that this does not happen on the nodes, that is interesting information. Two possible fixes spring to mind: A) Add a rule when there are no Pods on the NodePort that rejects the connection. Efficient. But: iptables is never easy. Also I don't know if this will cause health checks to fail, which isn't wrong but would slow down recovery. cc @felipejfc as this looks similar to what you are reporting in #41640 cc @thockin for kube-proxy guru-ness and advice on which option to pursue |
I actually had 2 namespaces with 3 services each(ELB type) that had no pods associated because someone forgotten do delete them after deleting the pods and they had healthchecks configured, after deleting the services today we've seen a massive networking performance boost (thousands of sockets in CLOSE_WAIT state got closed). I'll look for other services with no pod associated and delete them and will keep an eye on the cluster Thanks for helping @justinsb !! |
Automatic merge from submit-queue Install a REJECT rule for nodeport with no backend Rather than actually accepting the connection, REJECT. This will avoid CLOSE_WAIT. Fixes #43212 @justinsb @felipejfc @Spiddy
@justinsb do you recall if we ported this back to 1.6? |
@thockin looks like we got the first one into 1.6, but not the second one :-( These are the two commits (for some reason github only shows the branches in this view, not the PR view...) I did reopen the cherry-pick for the 1st to 1.5 this morning (I've been getting pings on this issue): #43858 Looks like we should get #43858 in, and then cherry-pick 9a423b6 to 1.5 and 1.6. I do recommend to people hitting this in the real world that they remove services without endpoints - it is almost always just an error/oversight. I don't think it's a huge problem to leak a few connections on a restart if you happen to end up with no pods for a ~minute. Also, typically this saves the cost of an extra ELB. |
What version of Kubernetes is it expected that this issue is resolved in? The problem still manifests on my Kubernetes 1.6.3 deployment. |
still on kubernetes 1.6.4 on aws, service which type is load balancer in our cluster are all pods-associated, but still have thousands of CLOSE_WAIT |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No.
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
CLOSE_WAIT
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
Kubernetes version (use
kubectl version
):Environment:
uname -a
): Linux ip-172-31-64-14 4.4.41-k8s Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Mon Jan 9 15:34:39 UTC 2017 x86_64 GNU/LinuxWhat happened:
The problem was triggered when an Service type=LoadBalancer was left without ready Pods. This triggers a wave of CLOSE_WAIT on the master node(s) that is reproducible.
What you expected to happen:
There should not be any flooding of CLOSE_WAIT connections.
How to reproduce it (as minimally and precisely as possible):
This should trigger the CLOSE_WAIT on the master.
Take note that because the kops v1.5.3 is using taints instead of SchedulingDisabled (kubernetes/kops#639) the master nodes are also added under the ELB on AWS.
Anything else we need to know:
If Pods are added to the LoadBalancer service, the CLOSE_WAITS will stop raising when ready.
The CLOSE_WAITs start raising on master when ELB is tagging the master node as "InService" not before that.
Once too many CLOSE_WAITs are generated the following error appears, and master is marked as
not_ready
and ssh is unresponsive. Logs were gathered from "AWS > Instance Settings > Get System Log"reported together with @mikim83
The text was updated successfully, but these errors were encountered: