Description
We're seeing an abnormal number of soft lockups, pegged cpus and unusable nodes recently. The last repros were from @bowei and @freehan.
Symptoms
Kernel logs showed ebtables related traces, CPU was pegged at 100% and remained there, no ssh, basically no clarity till the node was reset. Looking back through older test logs, there are several failures that had NotReady nodes that MIGHT be the same bug.
Actions
Minhan's working on a repro and @dchen1107 is trying to figure out when this spike happened by spelunking test logs.
We've been syncing iptables rules from kube-proxy more often than we need to for a couple of releases (#26637), this is the only thing that springs to mind that might cause cpu spikes.
We should probably try to mitigate for 1.5, marking as release-blocker till we have a better handle.
@saad-ali @kubernetes/sig-network @kubernetes/sig-node
Activity
saad-ali commentedon Dec 2, 2016
Based on @dchen1107's assessment, downgrading this issue to non-release-blocker:
She indicates that they have not been able to reliably repro the issue, and that it appears to be in underlying infrastructure, not k8s (or any user space changes).
She also indicates that "The reason of why we observed such failure on CVM a lot, but not on GCI is identified. On GCI node,
kernel.softlockup_panic = 1
; but on CVM, it is disabled. While the root cause is still under the investigation, we should configure CVM node the same as GCI node. I will send pr shortly."Merge pull request #38001 from dchen1107/master
robin-anil commentedon Dec 3, 2016
FWIW. we are seeing similar symptoms in 1.4.6 on GKE. It happens once a week (we have about 10 mini clusters and about 80 nodes in total), pegged CPU, no ssh and finally we have to reset the machine. Not sure what we should look at to find the root cause of this. Halp!
140 remaining items