Description
What happened:
In our stress testing, we found the deadlock issue in PLEG relist() for healthy checking and Kubelet syncLoop()
The goroutine stack trace is as following
goroutine 386 [chan send, 1140 minutes]:
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).relist(0xc42069ea20)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:261 +0x74e
k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).(k8s.io/kubernetes/pkg/kubelet/pleg.relist)-fm()
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x2a
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc4212ee520)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc4212ee520, 0x3b9aca00, 0x0, 0x1, 0xc420056540)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc4212ee520, 0x3b9aca00, 0xc420056540)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/pkg/kubelet/pleg.(*GenericPLEG).Start
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/pleg/generic.go:130 +0x88
...
goroutine 309 [sleep]:
time.Sleep(0x12a05f200)
/usr/local/go/src/runtime/time.go:102 +0x166
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncLoop(0xc4205e3b00, 0xc420ff2780, 0x3e56a60, 0xc4205e3b00)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1777 +0x1e7
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).Run(0xc4205e3b00, 0xc420ff2780)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:1396 +0x27f
k8s.io/kubernetes/cmd/kubelet/app.startKubelet.func1()
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:998 +0x67
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42105dfb0)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42105dfb0, 0x0, 0x0, 0x1, 0xc420056540)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42105dfb0, 0x0, 0xc420056540)
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
/workspace/anago-v1.11.5-beta.0.24+753b2dbc622f5c/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:996 +0xea
...
We did some investigation, and found the following issue.
...
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
...
for {
if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
glog.Infof("skipping pod synchronization - %v", rs)
// exponential backoff
time.Sleep(duration)
duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
continue
}
// reset backoff if we have a success
duration = base
kl.syncLoopMonitor.Store(kl.clock.Now())
if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
break
}
kl.syncLoopMonitor.Store(kl.clock.Now())
}
}
...
func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
select {
...
case e := <-plegCh:
if isSyncPodWorthy(e) {
// PLEG event for a pod; sync it.
if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
glog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
handler.HandlePodSyncs([]*v1.Pod{pod})
} else {
// If the pod no longer exists, ignore the event.
glog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
}
}
if e.Type == pleg.ContainerDied {
if containerID, ok := e.Data.(string); ok {
kl.cleanUpContainersInPod(e.ID, containerID)
}
}
...
}
return true
}
The main loop for event processing in Kubelet will check if there are some runtimeErrors: if yes, wait until runtime back to normal; if no, it will call syncLoopIteration to read and process events.
And the runtime healthChecks depends on the GenericPLEG.relist. But in the relist() impl, it will send the pod lifecycle event to eventChannel which may introduce deadlock when eventChannel is full.
func (g *GenericPLEG) relist() {
...
// If there are events associated with a pod, we should update the
// podCache.
for pid, events := range eventsByPodID {
...
for i := range events {
// Filter out events that are not reliable and no other components use yet.
if events[i].Type == ContainerChanged {
continue
}
g.eventChannel <- events[i]
}
}
...
}
And in the above code, the eventChannel will not be consumed if the runtime is not healthy. In some situation, the eventChannel may full, and the relist() will hang in sending new pod lifecycle event. And the runtime status will not be able to recovered.
It is related to issues #45419
What you expected to happen:
No such issue
How to reproduce it (as minimally and precisely as possible):
It is not easy to reproduce, it happens after few days stress testing (scaling the deployment repeatedly with high load).
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.11.5 - Cloud provider or hardware configuration: Bare metal
- OS (e.g. from /etc/os-release): CentOS 7.4
- Kernel (e.g.
uname -a
):
Linux bjc-qtt-k8s-node-010 3.10.0-693.2.2.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Sep 12 22:26:13 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kubeadm
- Others:
/kind bug
Activity
BSWANG commentedon Jan 2, 2019
/sig node
[-]Deadlock in PLEG relist() for heathy checking and Kubelet syncLoop()[/-][+]Deadlock in PLEG relist() for health check and Kubelet syncLoop()[/+]denverdino commentedon Jan 8, 2019
changyaowei commentedon Jan 8, 2019
When pleg do relist, can we invoke
kl.runtimeState.runtimeErrors func
to judge runtime is ok, if not, not do relistchangyaowei commentedon Jan 9, 2019
I submit a PR #72709
Random-Liu commentedon Jan 15, 2019
Thanks for finding the issue! The issue description makes sense to me.
I think PLEG is a core component, we should try our best to unblock it. The event channel becoming full seems more like an issue in the event consumer side, which shouldn't block PLEG.
Option 1
One solution is to queue the events in PLEG and send later, but that doesn't fundamentally solve the problem, if we don't set a limit on the event number we queue, it seems like a potential memory leak to me; if we set a limit, we may eventually hit this problem again.
Option 2
Another solution is to add a retry for sending event in PLEG, and set a bit in PLEG in between the retry which tells that PLEG is not blocking, it is trying to send out the event. And in the
Healthy
function of PLEG, it returns healthy if the bit is set.Option 3
Actually, even if we fail to send the event and discard it, since we have already updated the pod cache, the periodical pod resync should eventually pick it up. This may delay pod sync for up to
1 min
.kubernetes/pkg/kubelet/apis/config/v1beta1/defaults.go
Line 51 in 4fec22b
We can:
1000
.kubernetes/pkg/kubelet/kubelet.go
Line 151 in 4fec22b
In the 3 options above, I'm fine with either 2 or 3. Option 2 doesn't have the 1min pod sync delay, but a bit ad-hoc; option 3 is more generic, but may delay pod sync.
There might be other options.
/cc @yujuhong Any thoughts?
Random-Liu commentedon Jan 15, 2019
If PLEG is already blocked by a full event channel, this doesn't help right?
dchen1107 commentedon Jan 15, 2019
cc/ @jiayingz This looks like the one production issue you mentioned to me a while back.
derekwaynecarr commentedon Jan 15, 2019
@Random-Liu of options enumerated, option 3 was preferred for me.
yujuhong commentedon Jan 15, 2019
Thanks for reporting the issue!
@denverdino, @resouer what was the load on the node that triggered this condition? If the node wasn't function at all, and the events slowly built up over time, bumping the size of the channel won't help.
This should work, and doesn't complicate the logic in PLEG more, so it's my preferred option too.
We need some unit tests, and ideally, a stress test, as mentioned in the sig-node meeting.
resouer commentedon Jan 15, 2019
Agree, giving a proper size for event channel is tricky as the scale also grows with time. We will ship fix as option 3 (#72709) for 1.14. Thanks for summarizing all these up as sig meeting!
/assign
Random-Liu commentedon Jan 15, 2019
@resouer If possible, could you share in how large the stress is? It helps understand the issue and kubelet stress handling more. :)
resouer commentedon Jan 15, 2019
@Random-Liu @yujuhong Sure, our product team (@denverdino ) will input the stress test data after removing sensitive information in follow up comments.
BSWANG commentedon Jan 16, 2019
@resouer @Random-Liu @yujuhong Thanks for reply.
Following is our testcase to reproduce this issue:
The test cluster has 5 node on aliyun ECS(like EC2). Three 4C8G machine as master node, Two 96C384G as worker node.
Firstly, We create 2 deployment running on the worker node. One deployment is http server pods. Another is siege pods to produce network stress to http server deployment. Those stress test deployments produce 50 CPU Core load on worker node.
Then we create a job frequently create/delete 20 pod every 5 seconds.
After about 1 day of stess test, The worker node will become NotReady. Dig the pprof goroutine stacktrace found the channel block in
relist
.Random-Liu commentedon Jan 16, 2019
@BSWANG Yeah, I can see how frequent container state change can cause this issue. The number is very useful, thanks a lot!
chestack commentedon Jun 17, 2019
@denverdino cloud you please help to share how to dump the goroutines stack trace of kubelet like what you pasted in your comment. thanks a lot.
denverdino commentedon Jun 17, 2019
This issue is fixed by #72709
chestack commentedon Jun 18, 2019
using pprof is a answer to my question
curl http://localhost:8001/api/v1/proxy/nodes/node-1/debug/pprof/goroutine?debug=2
@denverdino thank you anyway.