Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

root cause kernel soft lockups #37853

Closed
bprashanth opened this issue Dec 1, 2016 · 113 comments
Closed

root cause kernel soft lockups #37853

bprashanth opened this issue Dec 1, 2016 · 113 comments
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network.
Milestone

Comments

@bprashanth
Copy link
Contributor

bprashanth commented Dec 1, 2016

We're seeing an abnormal number of soft lockups, pegged cpus and unusable nodes recently. The last repros were from @bowei and @freehan.

Symptoms
Kernel logs showed ebtables related traces, CPU was pegged at 100% and remained there, no ssh, basically no clarity till the node was reset. Looking back through older test logs, there are several failures that had NotReady nodes that MIGHT be the same bug.

Actions
Minhan's working on a repro and @dchen1107 is trying to figure out when this spike happened by spelunking test logs.

We've been syncing iptables rules from kube-proxy more often than we need to for a couple of releases (#26637), this is the only thing that springs to mind that might cause cpu spikes.

We should probably try to mitigate for 1.5, marking as release-blocker till we have a better handle.
@saad-ali @kubernetes/sig-network @kubernetes/sig-node

@bprashanth bprashanth added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-blocker labels Dec 1, 2016
@bprashanth bprashanth added this to the v1.5 milestone Dec 1, 2016
@saad-ali
Copy link
Member

saad-ali commented Dec 2, 2016

Based on @dchen1107's assessment, downgrading this issue to non-release-blocker:

She indicates that they have not been able to reliably repro the issue, and that it appears to be in underlying infrastructure, not k8s (or any user space changes).

She also indicates that "The reason of why we observed such failure on CVM a lot, but not on GCI is identified. On GCI node, kernel.softlockup_panic = 1; but on CVM, it is disabled. While the root cause is still under the investigation, we should configure CVM node the same as GCI node. I will send pr shortly."

k8s-github-robot pushed a commit that referenced this issue Dec 3, 2016
Automatic merge from submit-queue

Set kernel.softlockup_panic =1 based on the flag.

ref: #37853
@robin-anil
Copy link

FWIW. we are seeing similar symptoms in 1.4.6 on GKE. It happens once a week (we have about 10 mini clusters and about 80 nodes in total), pegged CPU, no ssh and finally we have to reset the machine. Not sure what we should look at to find the root cause of this. Halp!

@bprashanth
Copy link
Contributor Author

those are also symptoms of a naturally overloaded node, suggest checking serial console output of signs of a soft lockup (http://stackoverflow.com/questions/27734763/how-do-you-access-the-console-of-a-gce-vm-instance)

@robin-anil
Copy link

@bprashanth thanks for that tip, I now have clear pattern of what is happening. From on rare occasions the network connection resets. That causes sudden CPU spike on the node regardless of what is running on the node (we have different clusters with different jobs (haproxy, java http server etc etc)).

[188168.518885] cbr0: port 2(vethad9733de) entered disabled state
[188168.528185] device vethad9733de left promiscuous mode
[188168.533470] cbr0: port 2(vethad9733de) entered disabled state
[188181.004092] cbr0: port 3(veth6d7dcbed) entered disabled state
[188181.012766] device veth6d7dcbed left promiscuous mode
[188181.018022] cbr0: port 3(veth6d7dcbed) entered disabled state
[188181.088671] cbr0: port 1(veth7fcf1893) entered disabled state
[188181.097974] device veth7fcf1893 left promiscuous mode
[188181.103232] cbr0: port 1(veth7fcf1893) entered disabled state
[188219.102171] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[188219.116995] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[188219.123665] device veth8e5451f6 entered promiscuous mode
[188219.129248] cbr0: port 1(veth8e5451f6) entered forwarding state
[188219.135413] cbr0: port 1(veth8e5451f6) entered forwarding state
[188219.317414] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[188219.328398] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[188219.335113] device veth4a3eaad5 entered promiscuous mode
[188219.340697] cbr0: port 2(veth4a3eaad5) entered forwarding state
[188219.346831] cbr0: port 2(veth4a3eaad5) entered forwarding state
[188219.385428] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[188219.397027] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[188219.403617] device vethf2b0c924 entered promiscuous mode
[188219.409168] cbr0: port 3(vethf2b0c924) entered forwarding state
[188219.415300] cbr0: port 3(vethf2b0c924) entered forwarding state
[188234.178997] cbr0: port 1(veth8e5451f6) entered forwarding state
[188234.370986] cbr0: port 2(veth4a3eaad5) entered forwarding state
[188234.434980] cbr0: port 3(vethf2b0c924) entered forwarding state

@saad-ali
Copy link
Member

saad-ali commented Dec 5, 2016

Based on today's burndown meeting (notes) marking as release blocking for 1.5 until we have verification this is not a k8s 1.5 caused issue.

@dchen1107
Copy link
Member

dchen1107 commented Dec 5, 2016

r-tock@ when you mentioned the node running into the possible similar issue with 1.4.6, do you mean you never observe the the issue with previous release? Even something like 1.4.2? Also when your node running into the issue, have observed the same soft lockup message on the node's serial port log? cc/ @jlowdermilk too since his script job show no production occurrences.

eparis@ yes, we do plan to kernel.softlockup_panic =1 with the image (#38001). The problem is that with the same images (same kernel), we saw failure rate increased dramatically since Thanksgiving holiday. We need to figure out the root cause on why?

@bprashanth
Copy link
Contributor Author

fwiw I believe the kern log messages mentioned in #37853 (comment) are unrelated and benign, though I can't say for sure from the given snippet. You will see them with pod churn, the kernel is basically logging veth/eth0 information about the container.

@robin-anil
Copy link

@dchen1107 @bprashanth CPU spikes started happening on our java server dedicated clusters with 1.4 (which I believe is the switch to GCI) and every minor version after. Up until a few days ago, we were running our HAProxy dedicated clusters on 1.3.3 which had none of these issues. But ever since we upgraded we have been having this issue on that cluster as well.

I haven't seen any soft lockup specific errors, all I have seen are sudden CPU spikes and non-responding nodes and those errors I pasted above, which you say are benign.

@timothysc
Copy link
Member

timothysc commented Dec 6, 2016

@bprashanth merge my min-sync patch (xref: #37726 ) and test ;-) with a modified param.

/cc @eparis

@bprashanth
Copy link
Contributor Author

at this point it looks unrelated to networking, people are working on a repro and will probably have an update soon

@timothysc
Copy link
Member

@bprashanth data to support? everything in this issue so far says networking.

@timothysc
Copy link
Member

Also if we have an isolated repoducer, I can quick turn an eval to see if it occurs on our kernels.

@thockin
Copy link
Member

thockin commented Dec 6, 2016 via email

@timothysc
Copy link
Member

I need more data to assist. Setting up kubemark 5k on our end is nontrivial and the number of stack differences is also high.

@thockin
Copy link
Member

thockin commented Dec 6, 2016 via email

@Random-Liu
Copy link
Member

Random-Liu commented Dec 10, 2016

@mtaufen In fact we didn't delete the node ourselves, we did made sure to collect serial logs before deleting the node.

Someone else deleted the node, or GCE just didn't recognize the node. :(

@mtaufen
Copy link
Contributor

mtaufen commented Dec 10, 2016 via email

@ixdy
Copy link
Member

ixdy commented Dec 10, 2016

we have a janitor process, but I think it's only supposed to clean up VMs older than a certain age. @krzyzacy

@krzyzacy
Copy link
Member

@ixdy coverage is on pr projects only

@krzyzacy
Copy link
Member

But if you'd happy feel free add more entries to https://github.com/krzyzacy/test-infra/blob/master/jobs/maintenance-pull-janitor.sh#L21

@thockin
Copy link
Member

thockin commented Dec 10, 2016 via email

@bprashanth
Copy link
Contributor Author

@ixdy didn't you cron this at some point? #25629 (comment)

@ixdy
Copy link
Member

ixdy commented Dec 10, 2016

yes, we run kubernetes/test-infra/jenkins/clean_project.py --project=k8s-jkns-ci-node-e2e --hours=3 --delete every 3 hours. it shouldn't be deleting anything newer than 3 hours old, though.

@dchen1107
Copy link
Member

Updated the issue based on #37853 (comment)

  1. 1.5 release branch looks fine over the weekend. I am removing release blocker from the issue for now.
  2. @dashpole is trying to reproduce the original issue with --experimental-kernel-memcg-notification=true, but failed. We are continue working on this.

@dashpole
Copy link
Contributor

dashpole commented Dec 13, 2016

I was able to trigger soft lockup using this: #38731 (on containervm)

@dashpole
Copy link
Contributor

I have had a GCI cluster running the above for 5 days without encountering soft lockup.
Additionally, using the code from threshold_notifier.go from #32577, I wrote a script that crosses the memory threshold many times a second.
This triggered soft lockup fairly quickly on CVM (<2 hours).
However, after a weekend of running on GCI, I haven't encountered soft lockup using this method either.
I added the go code for the second repro case to #38731.

@timothysc
Copy link
Member

k8s-github-robot pushed a commit that referenced this issue Mar 8, 2017
Automatic merge from submit-queue

New e2e node test suite with memcg turned on

The flag --experimental-kernal-memcg-notification was initially added to allow disabling an eviction feature which used memcg notifications to make memory evictions more reactive.
As documented in #37853, memcg notifications increased the likelihood of encountering soft lockups, especially on CVM.

This feature would valuable to turn on, at least for GCI, since soft lockup issues were less prevalent on GCI and appeared (at the time) to be unrelated to memcg notifications.

In the interest of caution, I would like to monitor serial tests on GCI with --experimental-kernal-memcg-notification=true.

cc @vishh @Random-Liu @dchen1107 @kubernetes/sig-node-pr-reviews
@gmarek gmarek removed their assignment Apr 3, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@caseydavenport
Copy link
Member

/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Jun 12, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 12, 2017
@keyingliu
Copy link

keyingliu commented Jul 3, 2017

In our production cluster, etcd is running on each master node, after we created about 2000-3000 services (that mean there are greater than 10000 iptable rules), etcd cluster start to have frequent leader election which cause our kubernetes unstable, found iptable-restore uses high cpu. After we stopped kube-proxy on master nodes, the leader election didn't happen again:

  • with kube-proxy running on master nodes, etcd leader election frequently
  • stopped kube-proxy and clear iptables rules, no leader election
  • stopped kube-proxy but keep iptables rules, no leader election

Not sure if it is because of frequent iptables-restore calling.

@bowei
Copy link
Member

bowei commented Jul 5, 2017

@keyingliu please open a new issue for your kube-proxy problem. This issue is specifically about the kernel soft lock up issue which has been resolved.

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests