Skip to content

Deadlock in PLEG relist() for health check and Kubelet syncLoop() #72482

Closed
@denverdino

Description

@denverdino
Contributor

What happened:

In our stress testing, we found the deadlock issue in PLEG relist() for healthy checking and Kubelet syncLoop()

The goroutine stack trace is as following

goroutine 386 [chan send, 1140 minutes]:
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc42069ea20)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:261 +0x74e
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).(k8s.io/kubernetes/pkg/kubelet/pleg.relist)-fm()
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4212ee520)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4212ee520, 0x3b9aca00, 0x0, 0x1, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4212ee520, 0x3b9aca00, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).Start
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x88
...
goroutine 309 [sleep]:
time.Sleep(0x12a05f200)
	/usr/local/go/src/runtime/time.go:102 +0x166
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncLoop(0xc4205e3b00, 0xc420ff2780, 0x3e56a60, 0xc4205e3b00)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1777 +0x1e7
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).Run(0xc4205e3b00, 0xc420ff2780)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1396 +0x27f
k8s.io/kubernetes/cmd/kubelet/app.startKubelet.func1()
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:998 +0x67
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42105dfb0)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42105dfb0, 0x0, 0x0, 0x1, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42105dfb0, 0x0, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:996 +0xea
...

We did some investigation, and found the following issue.

...
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
...
	for {
		if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
			glog.Infof("skipping pod synchronization - %v", rs)
			// exponential backoff
			time.Sleep(duration)
			duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
			continue
		}
		// reset backoff if we have a success
		duration = base

		kl.syncLoopMonitor.Store(kl.clock.Now())
		if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
			break
		}
		kl.syncLoopMonitor.Store(kl.clock.Now())
	}
}

...
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
	select {
         ...
	case e := <-plegCh:
		if isSyncPodWorthy(e) {
			// PLEG event for a pod; sync it.
			if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
				glog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
				handler.HandlePodSyncs([]*v1.Pod{pod})
			} else {
				// If the pod no longer exists, ignore the event.
				glog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
			}
		}

		if e.Type == pleg.ContainerDied {
			if containerID, ok := e.Data.(string); ok {
				kl.cleanUpContainersInPod(e.ID, containerID)
			}
		}
         ...
	}
	return true
}

The main loop for event processing in Kubelet will check if there are some runtimeErrors: if yes, wait until runtime back to normal; if no, it will call syncLoopIteration to read and process events.

And the runtime healthChecks depends on the GenericPLEG.relist. But in the relist() impl, it will send the pod lifecycle event to eventChannel which may introduce deadlock when eventChannel is full.

func (g *GenericPLEG) relist() {
        ...
	// If there are events associated with a pod, we should update the
	// podCache.
	for pid, events := range eventsByPodID {
                 ... 
		for i := range events {
			// Filter out events that are not reliable and no other components use yet.
			if events[i].Type == ContainerChanged {
				continue
			}
			g.eventChannel <- events[i]
		}
	}

	... 
}

And in the above code, the eventChannel will not be consumed if the runtime is not healthy. In some situation, the eventChannel may full, and the relist() will hang in sending new pod lifecycle event. And the runtime status will not be able to recovered.

It is related to issues #45419

What you expected to happen:

No such issue

How to reproduce it (as minimally and precisely as possible):

It is not easy to reproduce, it happens after few days stress testing (scaling the deployment repeatedly with high load).

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.11.5
  • Cloud provider or hardware configuration: Bare metal
  • OS (e.g. from /etc/os-release): CentOS 7.4
  • Kernel (e.g. uname -a):
    Linux bjc-qtt-k8s-node-010 3.10.0-693.2.2.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

/kind bug

Activity

added
kind/bugCategorizes issue or PR as related to a bug.
needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.
on Jan 2, 2019
BSWANG

BSWANG commented on Jan 2, 2019

@BSWANG
Contributor

/sig node

added
sig/nodeCategorizes an issue or PR as relevant to SIG Node.
and removed
needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.
on Jan 2, 2019
changed the title [-]Deadlock in PLEG relist() for heathy checking and Kubelet syncLoop()[/-] [+]Deadlock in PLEG relist() for health check and Kubelet syncLoop()[/+] on Jan 2, 2019
denverdino

denverdino commented on Jan 8, 2019

@denverdino
ContributorAuthor
changyaowei

changyaowei commented on Jan 8, 2019

@changyaowei
Contributor

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

changyaowei

changyaowei commented on Jan 9, 2019

@changyaowei
Contributor

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

I submit a PR #72709

Random-Liu

Random-Liu commented on Jan 15, 2019

@Random-Liu
Member

Thanks for finding the issue! The issue description makes sense to me.

I think PLEG is a core component, we should try our best to unblock it. The event channel becoming full seems more like an issue in the event consumer side, which shouldn't block PLEG.

Option 1

One solution is to queue the events in PLEG and send later, but that doesn't fundamentally solve the problem, if we don't set a limit on the event number we queue, it seems like a potential memory leak to me; if we set a limit, we may eventually hit this problem again.

Option 2

Another solution is to add a retry for sending event in PLEG, and set a bit in PLEG in between the retry which tells that PLEG is not blocking, it is trying to send out the event. And in the Healthy function of PLEG, it returns healthy if the bit is set.

Option 3

Actually, even if we fail to send the event and discard it, since we have already updated the pod cache, the periodical pod resync should eventually pick it up. This may delay pod sync for up to 1 min.

obj.SyncFrequency = metav1.Duration{Duration: 1 * time.Minute}
But if the channel is full, it means that either 1) the syncloop is blocked for some reason, or 2) the node is under heavy loads. In both cases, even the event is successfully sent, the pod sync will likely be delayed anyhow.

We can:

  1. Enlarge the event channel size to reduce the chance that the channel becoming full. Currently it is arbitrarily set as 1000.
    plegChannelCapacity = 1000
  2. Discard the event in PLEG if the channel is already full.

In the 3 options above, I'm fine with either 2 or 3. Option 2 doesn't have the 1min pod sync delay, but a bit ad-hoc; option 3 is more generic, but may delay pod sync.

There might be other options.

/cc @yujuhong Any thoughts?

Random-Liu

Random-Liu commented on Jan 15, 2019

@Random-Liu
Member

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

If PLEG is already blocked by a full event channel, this doesn't help right?

dchen1107

dchen1107 commented on Jan 15, 2019

@dchen1107
Member

cc/ @jiayingz This looks like the one production issue you mentioned to me a while back.

added this to the v1.14 milestone on Jan 15, 2019
derekwaynecarr

derekwaynecarr commented on Jan 15, 2019

@derekwaynecarr
Member

@Random-Liu of options enumerated, option 3 was preferred for me.

yujuhong

yujuhong commented on Jan 15, 2019

@yujuhong
Contributor

Thanks for reporting the issue!

  1. Enlarge the event channel size to reduce the chance that the channel becoming full. Currently it is arbitrarily set as 1000.

@denverdino, @resouer what was the load on the node that triggered this condition? If the node wasn't function at all, and the events slowly built up over time, bumping the size of the channel won't help.

2. Discard the event in PLEG if the channel is already full.

This should work, and doesn't complicate the logic in PLEG more, so it's my preferred option too.
We need some unit tests, and ideally, a stress test, as mentioned in the sig-node meeting.

resouer

resouer commented on Jan 15, 2019

@resouer
Contributor

If the node wasn't function at all, and the events slowly built up over time, bumping the size of the channel won't help.

Agree, giving a proper size for event channel is tricky as the scale also grows with time. We will ship fix as option 3 (#72709) for 1.14. Thanks for summarizing all these up as sig meeting!

/assign

Random-Liu

Random-Liu commented on Jan 15, 2019

@Random-Liu
Member

@resouer If possible, could you share in how large the stress is? It helps understand the issue and kubelet stress handling more. :)

resouer

resouer commented on Jan 15, 2019

@resouer
Contributor

@Random-Liu @yujuhong Sure, our product team (@denverdino ) will input the stress test data after removing sensitive information in follow up comments.

BSWANG

BSWANG commented on Jan 16, 2019

@BSWANG
Contributor

@resouer @Random-Liu @yujuhong Thanks for reply.
Following is our testcase to reproduce this issue:
The test cluster has 5 node on aliyun ECS(like EC2). Three 4C8G machine as master node, Two 96C384G as worker node.
Firstly, We create 2 deployment running on the worker node. One deployment is http server pods. Another is siege pods to produce network stress to http server deployment. Those stress test deployments produce 50 CPU Core load on worker node.
Then we create a job frequently create/delete 20 pod every 5 seconds.
After about 1 day of stess test, The worker node will become NotReady. Dig the pprof goroutine stacktrace found the channel block in relist.

Random-Liu

Random-Liu commented on Jan 16, 2019

@Random-Liu
Member

Then we create a job frequently create/delete 20 pod every 5 seconds.

@BSWANG Yeah, I can see how frequent container state change can cause this issue. The number is very useful, thanks a lot!

chestack

chestack commented on Jun 17, 2019

@chestack
Contributor

@denverdino cloud you please help to share how to dump the goroutines stack trace of kubelet like what you pasted in your comment. thanks a lot.

denverdino

denverdino commented on Jun 17, 2019

@denverdino
ContributorAuthor

This issue is fixed by #72709

chestack

chestack commented on Jun 18, 2019

@chestack
Contributor

using pprof is a answer to my question
curl http://localhost:8001/api/v1/proxy/nodes/node-1/debug/pprof/goroutine?debug=2

@denverdino thank you anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

Type

No type

Projects

No projects

Relationships

None yet

    Participants

    @denverdino@resouer@BSWANG@Random-Liu@derekwaynecarr

    Issue actions

      Deadlock in PLEG relist() for health check and Kubelet syncLoop() · Issue #72482 · kubernetes/kubernetes