Skip to content

root cause kernel soft lockups  #37853

Closed
Closed
@bprashanth

Description

@bprashanth
Contributor

We're seeing an abnormal number of soft lockups, pegged cpus and unusable nodes recently. The last repros were from @bowei and @freehan.

Symptoms
Kernel logs showed ebtables related traces, CPU was pegged at 100% and remained there, no ssh, basically no clarity till the node was reset. Looking back through older test logs, there are several failures that had NotReady nodes that MIGHT be the same bug.

Actions
Minhan's working on a repro and @dchen1107 is trying to figure out when this spike happened by spelunking test logs.

We've been syncing iptables rules from kube-proxy more often than we need to for a couple of releases (#26637), this is the only thing that springs to mind that might cause cpu spikes.

We should probably try to mitigate for 1.5, marking as release-blocker till we have a better handle.
@saad-ali @kubernetes/sig-network @kubernetes/sig-node

Activity

added this to the v1.5 milestone on Dec 1, 2016
saad-ali

saad-ali commented on Dec 2, 2016

@saad-ali
Member

Based on @dchen1107's assessment, downgrading this issue to non-release-blocker:

She indicates that they have not been able to reliably repro the issue, and that it appears to be in underlying infrastructure, not k8s (or any user space changes).

She also indicates that "The reason of why we observed such failure on CVM a lot, but not on GCI is identified. On GCI node, kernel.softlockup_panic = 1; but on CVM, it is disabled. While the root cause is still under the investigation, we should configure CVM node the same as GCI node. I will send pr shortly."

added a commit that references this issue on Dec 3, 2016

Merge pull request #38001 from dchen1107/master

robin-anil

robin-anil commented on Dec 3, 2016

@robin-anil

FWIW. we are seeing similar symptoms in 1.4.6 on GKE. It happens once a week (we have about 10 mini clusters and about 80 nodes in total), pegged CPU, no ssh and finally we have to reset the machine. Not sure what we should look at to find the root cause of this. Halp!

140 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.sig/networkCategorizes an issue or PR as relevant to SIG Network.

Type

No type

Projects

No projects

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @dims@timothysc@sjenning@bowei@mtaufen

      Issue actions

        root cause kernel soft lockups · Issue #37853 · kubernetes/kubernetes