-
Notifications
You must be signed in to change notification settings - Fork 40.6k
root cause kernel soft lockups #37853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Based on @dchen1107's assessment, downgrading this issue to non-release-blocker: She indicates that they have not been able to reliably repro the issue, and that it appears to be in underlying infrastructure, not k8s (or any user space changes). She also indicates that "The reason of why we observed such failure on CVM a lot, but not on GCI is identified. On GCI node, |
Automatic merge from submit-queue Set kernel.softlockup_panic =1 based on the flag. ref: #37853
FWIW. we are seeing similar symptoms in 1.4.6 on GKE. It happens once a week (we have about 10 mini clusters and about 80 nodes in total), pegged CPU, no ssh and finally we have to reset the machine. Not sure what we should look at to find the root cause of this. Halp! |
those are also symptoms of a naturally overloaded node, suggest checking serial console output of signs of a soft lockup (http://stackoverflow.com/questions/27734763/how-do-you-access-the-console-of-a-gce-vm-instance) |
@bprashanth thanks for that tip, I now have clear pattern of what is happening. From on rare occasions the network connection resets. That causes sudden CPU spike on the node regardless of what is running on the node (we have different clusters with different jobs (haproxy, java http server etc etc)).
|
Based on today's burndown meeting (notes) marking as release blocking for 1.5 until we have verification this is not a k8s 1.5 caused issue. |
r-tock@ when you mentioned the node running into the possible similar issue with 1.4.6, do you mean you never observe the the issue with previous release? Even something like 1.4.2? Also when your node running into the issue, have observed the same soft lockup message on the node's serial port log? cc/ @jlowdermilk too since his script job show no production occurrences. eparis@ yes, we do plan to kernel.softlockup_panic =1 with the image (#38001). The problem is that with the same images (same kernel), we saw failure rate increased dramatically since Thanksgiving holiday. We need to figure out the root cause on why? |
fwiw I believe the kern log messages mentioned in #37853 (comment) are unrelated and benign, though I can't say for sure from the given snippet. You will see them with pod churn, the kernel is basically logging veth/eth0 information about the container. |
@dchen1107 @bprashanth CPU spikes started happening on our java server dedicated clusters with 1.4 (which I believe is the switch to GCI) and every minor version after. Up until a few days ago, we were running our HAProxy dedicated clusters on 1.3.3 which had none of these issues. But ever since we upgraded we have been having this issue on that cluster as well. I haven't seen any soft lockup specific errors, all I have seen are sudden CPU spikes and non-responding nodes and those errors I pasted above, which you say are benign. |
@bprashanth merge my min-sync patch (xref: #37726 ) and test ;-) with a modified param. /cc @eparis |
at this point it looks unrelated to networking, people are working on a repro and will probably have an update soon |
@bprashanth data to support? everything in this issue so far says networking. |
Also if we have an isolated repoducer, I can quick turn an eval to see if it occurs on our kernels. |
We do not have an isolated repro yet. Best repro I have heard is
kubemark-5000. PLEASE we need signal on this - if it reproduces on anyone
else's systems but our.s
…On Tue, Dec 6, 2016 at 10:06 AM, Timothy St. Clair ***@***.*** > wrote:
Also if we have an isolated repoducer, I can quick turn an eval to see if
it occurs on our kernels.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#37853 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVAZNTVG6t6840xPnAi2_RLyHoggrks5rFaQKgaJpZM4LB4wS>
.
|
I need more data to assist. Setting up kubemark 5k on our end is nontrivial and the number of stack differences is also high. |
We're actively trying to narrow a repro and remove variables. The matrix
is large, one dimension is "just google". Any repro we get from outside
google would help.
…On Tue, Dec 6, 2016 at 10:44 AM, Timothy St. Clair ***@***.*** > wrote:
I need more data to assist. Setting up kubemark 5k on our end is
nontrivial and the number of stack differences is also high.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
<#37853 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVNzS-7fVwyt5jWvR2nYbydjtYBePks5rFa0egaJpZM4LB4wS>
.
|
@mtaufen In fact we didn't delete the node ourselves, we did made sure to collect serial logs before deleting the node. Someone else deleted the node, or GCE just didn't recognize the node. :( |
:(
Does Jenkins have a reaper that would delete it out-of-band?
…On Fri, Dec 9, 2016 at 4:21 PM, Lantao Liu ***@***.***> wrote:
@mtaufen <https://github.com/mtaufen> In fact we didn't delete the node
ourselves, we did collect serial logs before deleting the node.
Someone else deleted the node, or GCE just didn't recognize the node. :(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37853 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA3JwX7iozgQ3zPoF6NXDHfetvGsgve2ks5rGfCZgaJpZM4LB4wS>
.
--
Michael Taufen
MTV SWE
|
we have a janitor process, but I think it's only supposed to clean up VMs older than a certain age. @krzyzacy |
@ixdy coverage is on pr projects only |
But if you'd happy feel free add more entries to https://github.com/krzyzacy/test-infra/blob/master/jobs/maintenance-pull-janitor.sh#L21 |
@ixdy GCP serial port is a smallish buffer. If you don't scrape it
on-going, you only get the last part.
…On Dec 9, 2016 2:26 PM, "Jeff Grafton" ***@***.***> wrote:
How is that different from gcloud compute instances get-serial-port-output,
which we already collect
<https://github.com/kubernetes/kubernetes/blob/master/cluster/log-dump.sh#L103-L104>
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37853 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVN3zgGbJhaFBRfbnefIUsSuUhVOzks5rGdWQgaJpZM4LB4wS>
.
|
@ixdy didn't you cron this at some point? #25629 (comment) |
yes, we run |
Updated the issue based on #37853 (comment)
|
I was able to trigger soft lockup using this: #38731 (on containervm) |
I have had a GCI cluster running the above for 5 days without encountering soft lockup. |
Automatic merge from submit-queue New e2e node test suite with memcg turned on The flag --experimental-kernal-memcg-notification was initially added to allow disabling an eviction feature which used memcg notifications to make memory evictions more reactive. As documented in #37853, memcg notifications increased the likelihood of encountering soft lockups, especially on CVM. This feature would valuable to turn on, at least for GCI, since soft lockup issues were less prevalent on GCI and appeared (at the time) to be unrelated to memcg notifications. In the interest of caution, I would like to monitor serial tests on GCI with --experimental-kernal-memcg-notification=true. cc @vishh @Random-Liu @dchen1107 @kubernetes/sig-node-pr-reviews
/sig network |
In our production cluster, etcd is running on each master node, after we created about 2000-3000 services (that mean there are greater than 10000 iptable rules), etcd cluster start to have frequent leader election which cause our kubernetes unstable, found iptable-restore uses high cpu. After we stopped kube-proxy on master nodes, the leader election didn't happen again:
Not sure if it is because of frequent iptables-restore calling. |
@keyingliu please open a new issue for your kube-proxy problem. This issue is specifically about the kernel soft lock up issue which has been resolved. /assign |
We're seeing an abnormal number of soft lockups, pegged cpus and unusable nodes recently. The last repros were from @bowei and @freehan.
Symptoms
Kernel logs showed ebtables related traces, CPU was pegged at 100% and remained there, no ssh, basically no clarity till the node was reset. Looking back through older test logs, there are several failures that had NotReady nodes that MIGHT be the same bug.
Actions
Minhan's working on a repro and @dchen1107 is trying to figure out when this spike happened by spelunking test logs.
We've been syncing iptables rules from kube-proxy more often than we need to for a couple of releases (#26637), this is the only thing that springs to mind that might cause cpu spikes.
We should probably try to mitigate for 1.5, marking as release-blocker till we have a better handle.
@saad-ali @kubernetes/sig-network @kubernetes/sig-node
The text was updated successfully, but these errors were encountered: