Skip to content

Deadlock in PLEG relist() for health check and Kubelet syncLoop() #72482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
denverdino opened this issue Jan 2, 2019 · 17 comments · Fixed by #72709
Closed

Deadlock in PLEG relist() for health check and Kubelet syncLoop() #72482

denverdino opened this issue Jan 2, 2019 · 17 comments · Fixed by #72709
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@denverdino
Copy link
Contributor

denverdino commented Jan 2, 2019

What happened:

In our stress testing, we found the deadlock issue in PLEG relist() for healthy checking and Kubelet syncLoop()

The goroutine stack trace is as following

goroutine 386 [chan send, 1140 minutes]:
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc42069ea20)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:261 +0x74e
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).(k8s.io/kubernetes/pkg/kubelet/pleg.relist)-fm()
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4212ee520)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4212ee520, 0x3b9aca00, 0x0, 0x1, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4212ee520, 0x3b9aca00, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).Start
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x88
...
goroutine 309 [sleep]:
time.Sleep(0x12a05f200)
	/usr/local/go/src/runtime/time.go:102 +0x166
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncLoop(0xc4205e3b00, 0xc420ff2780, 0x3e56a60, 0xc4205e3b00)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1777 +0x1e7
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).Run(0xc4205e3b00, 0xc420ff2780)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1396 +0x27f
k8s.io/kubernetes/cmd/kubelet/app.startKubelet.func1()
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:998 +0x67
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42105dfb0)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42105dfb0, 0x0, 0x0, 0x1, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42105dfb0, 0x0, 0xc420056540)
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
	/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:996 +0xea
...

We did some investigation, and found the following issue.

...
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
...
	for {
		if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
			glog.Infof("skipping pod synchronization - %v", rs)
			// exponential backoff
			time.Sleep(duration)
			duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
			continue
		}
		// reset backoff if we have a success
		duration = base

		kl.syncLoopMonitor.Store(kl.clock.Now())
		if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
			break
		}
		kl.syncLoopMonitor.Store(kl.clock.Now())
	}
}

...
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
	select {
         ...
	case e := <-plegCh:
		if isSyncPodWorthy(e) {
			// PLEG event for a pod; sync it.
			if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
				glog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
				handler.HandlePodSyncs([]*v1.Pod{pod})
			} else {
				// If the pod no longer exists, ignore the event.
				glog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
			}
		}

		if e.Type == pleg.ContainerDied {
			if containerID, ok := e.Data.(string); ok {
				kl.cleanUpContainersInPod(e.ID, containerID)
			}
		}
         ...
	}
	return true
}

The main loop for event processing in Kubelet will check if there are some runtimeErrors: if yes, wait until runtime back to normal; if no, it will call syncLoopIteration to read and process events.

And the runtime healthChecks depends on the GenericPLEG.relist. But in the relist() impl, it will send the pod lifecycle event to eventChannel which may introduce deadlock when eventChannel is full.

func (g *GenericPLEG) relist() {
        ...
	// If there are events associated with a pod, we should update the
	// podCache.
	for pid, events := range eventsByPodID {
                 ... 
		for i := range events {
			// Filter out events that are not reliable and no other components use yet.
			if events[i].Type == ContainerChanged {
				continue
			}
			g.eventChannel <- events[i]
		}
	}

	... 
}

And in the above code, the eventChannel will not be consumed if the runtime is not healthy. In some situation, the eventChannel may full, and the relist() will hang in sending new pod lifecycle event. And the runtime status will not be able to recovered.

It is related to issues #45419

What you expected to happen:

No such issue

How to reproduce it (as minimally and precisely as possible):

It is not easy to reproduce, it happens after few days stress testing (scaling the deployment repeatedly with high load).

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.11.5
  • Cloud provider or hardware configuration: Bare metal
  • OS (e.g. from /etc/os-release): CentOS 7.4
  • Kernel (e.g. uname -a):
    Linux bjc-qtt-k8s-node-010 3.10.0-693.2.2.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 2, 2019
@BSWANG
Copy link
Contributor

BSWANG commented Jan 2, 2019

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 2, 2019
@denverdino denverdino changed the title Deadlock in PLEG relist() for heathy checking and Kubelet syncLoop() Deadlock in PLEG relist() for health check and Kubelet syncLoop() Jan 2, 2019
@denverdino
Copy link
Contributor Author

denverdino commented Jan 8, 2019 via email

@changyaowei
Copy link
Contributor

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

@changyaowei
Copy link
Contributor

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

I submit a PR #72709

@Random-Liu
Copy link
Member

Random-Liu commented Jan 15, 2019

Thanks for finding the issue! The issue description makes sense to me.

I think PLEG is a core component, we should try our best to unblock it. The event channel becoming full seems more like an issue in the event consumer side, which shouldn't block PLEG.

Option 1

One solution is to queue the events in PLEG and send later, but that doesn't fundamentally solve the problem, if we don't set a limit on the event number we queue, it seems like a potential memory leak to me; if we set a limit, we may eventually hit this problem again.

Option 2

Another solution is to add a retry for sending event in PLEG, and set a bit in PLEG in between the retry which tells that PLEG is not blocking, it is trying to send out the event. And in the Healthy function of PLEG, it returns healthy if the bit is set.

Option 3

Actually, even if we fail to send the event and discard it, since we have already updated the pod cache, the periodical pod resync should eventually pick it up. This may delay pod sync for up to 1 min.

obj.SyncFrequency = metav1.Duration{Duration: 1 * time.Minute}
But if the channel is full, it means that either 1) the syncloop is blocked for some reason, or 2) the node is under heavy loads. In both cases, even the event is successfully sent, the pod sync will likely be delayed anyhow.

We can:

  1. Enlarge the event channel size to reduce the chance that the channel becoming full. Currently it is arbitrarily set as 1000.
    plegChannelCapacity = 1000
  2. Discard the event in PLEG if the channel is already full.

In the 3 options above, I'm fine with either 2 or 3. Option 2 doesn't have the 1min pod sync delay, but a bit ad-hoc; option 3 is more generic, but may delay pod sync.

There might be other options.

/cc @yujuhong Any thoughts?

@Random-Liu
Copy link
Member

Random-Liu commented Jan 15, 2019

When pleg do relist, can we invoke kl.runtimeState.runtimeErrors func to judge runtime is ok, if not, not do relist

If PLEG is already blocked by a full event channel, this doesn't help right?

@dchen1107
Copy link
Member

cc/ @jiayingz This looks like the one production issue you mentioned to me a while back.

@dchen1107 dchen1107 added this to the v1.14 milestone Jan 15, 2019
@derekwaynecarr
Copy link
Member

@Random-Liu of options enumerated, option 3 was preferred for me.

@yujuhong
Copy link
Contributor

Thanks for reporting the issue!

  1. Enlarge the event channel size to reduce the chance that the channel becoming full. Currently it is arbitrarily set as 1000.

@denverdino, @resouer what was the load on the node that triggered this condition? If the node wasn't function at all, and the events slowly built up over time, bumping the size of the channel won't help.

2. Discard the event in PLEG if the channel is already full.

This should work, and doesn't complicate the logic in PLEG more, so it's my preferred option too.
We need some unit tests, and ideally, a stress test, as mentioned in the sig-node meeting.

@resouer
Copy link
Contributor

resouer commented Jan 15, 2019

If the node wasn't function at all, and the events slowly built up over time, bumping the size of the channel won't help.

Agree, giving a proper size for event channel is tricky as the scale also grows with time. We will ship fix as option 3 (#72709) for 1.14. Thanks for summarizing all these up as sig meeting!

/assign

@Random-Liu
Copy link
Member

@resouer If possible, could you share in how large the stress is? It helps understand the issue and kubelet stress handling more. :)

@resouer
Copy link
Contributor

resouer commented Jan 15, 2019

@Random-Liu @yujuhong Sure, our product team (@denverdino ) will input the stress test data after removing sensitive information in follow up comments.

@BSWANG
Copy link
Contributor

BSWANG commented Jan 16, 2019

@resouer @Random-Liu @yujuhong Thanks for reply.
Following is our testcase to reproduce this issue:
The test cluster has 5 node on aliyun ECS(like EC2). Three 4C8G machine as master node, Two 96C384G as worker node.
Firstly, We create 2 deployment running on the worker node. One deployment is http server pods. Another is siege pods to produce network stress to http server deployment. Those stress test deployments produce 50 CPU Core load on worker node.
Then we create a job frequently create/delete 20 pod every 5 seconds.
After about 1 day of stess test, The worker node will become NotReady. Dig the pprof goroutine stacktrace found the channel block in relist.

@Random-Liu
Copy link
Member

Then we create a job frequently create/delete 20 pod every 5 seconds.

@BSWANG Yeah, I can see how frequent container state change can cause this issue. The number is very useful, thanks a lot!

@chestack
Copy link
Contributor

chestack commented Jun 17, 2019

@denverdino cloud you please help to share how to dump the goroutines stack trace of kubelet like what you pasted in your comment. thanks a lot.

@denverdino
Copy link
Contributor Author

This issue is fixed by #72709

@chestack
Copy link
Contributor

using pprof is a answer to my question
curl http://localhost:8001/api/v1/proxy/nodes/node-1/debug/pprof/goroutine?debug=2

@denverdino thank you anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants