-
Notifications
You must be signed in to change notification settings - Fork 41k
Description
Is this a request for help? No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): PLEG NotReady kubelet
Is this a BUG REPORT or FEATURE REQUEST? Bug
Kubernetes version (use kubectl version
): 1.6.2
Environment:
- Cloud provider or hardware configuration: CoreOS on AWS
- OS (e.g. from /etc/os-release):CoreOS 1353.7.0
- Kernel (e.g.
uname -a
): 4.9.24-coreos - Install tools:
- Others:
What happened:
I have a 3-worker cluster. Two and sometimes all three nodes keep dropping into NotReady
with the following messages in journalctl -u kubelet
:
May 05 13:59:56 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 13:59:56.872880 2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 13:59:56 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 13:59:56.872908 2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 13:59:56.872865742 +0000 UTC LastTransitionTime:2017-05-05 13:59:56.872865742 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.629592089s ago; threshold is 3m0s}
May 05 14:07:57 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:07:57.598132 2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:07:57 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:07:57.598162 2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:07:57.598117026 +0000 UTC LastTransitionTime:2017-05-05 14:07:57.598117026 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.346983738s ago; threshold is 3m0s}
May 05 14:17:58 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:17:58.536101 2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:17:58 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:17:58.536134 2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:17:58.536086605 +0000 UTC LastTransitionTime:2017-05-05 14:17:58.536086605 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.275467289s ago; threshold is 3m0s}
May 05 14:29:59 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:29:59.648922 2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:29:59 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:29:59.648952 2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:29:59.648910669 +0000 UTC LastTransitionTime:2017-05-05 14:29:59.648910669 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.377520804s ago; threshold is 3m0s}
May 05 14:44:00 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:44:00.938266 2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:44:00 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:44:00.938297 2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:44:00.938251338 +0000 UTC LastTransitionTime:2017-05-05 14:44:00.938251338 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.654775919s ago; threshold is 3m0s}
docker daemon is fine (local docker ps
, docker images
, etc. all work and respond immediately).
using weave networking installed via kubectl apply -f https://git.io/weave-kube-1.6
What you expected to happen:
Nodes to be ready.
How to reproduce it (as minimally and precisely as possible):
Wish I knew how!
Anything else we need to know:
All of the nodes (workers and masters) on same private subnet with NAT gateway to Internet. Workers in security group that allows unlimited access (all ports) from masters security group; masters allow all ports from same subnet. proxy is running on workers; apiserver, controller-manager, scheduler on masters.
kubectl logs
and kubectl exec
always hang, even when run from the master itself (or from outside).
Activity
yujuhong commentedon May 5, 2017
@deitch, how many containers were running on the node? What's the overall cpu utilization of your nodes?
deitch commentedon May 5, 2017
Basically none. kube-dns, weave-net, weave-npc, and 3 template sample services. Actually only one, because two had no image and were going to be cleaned up. AWS m4.2xlarge. Not a resource issue.
I ended up having to destroy the nodes and recreate. No PLEG messages since destroy/recreate, and they seem 50% ok. They stay
Ready
, although they still refuse to allowkubectl exec
orkubectl logs
.I really struggled to find any documentation on what PLEG really is, but more importantly how to check its own logs and state and debug it.
deitch commentedon May 5, 2017
Hmm... to add to the mystery, no container can resolve any hostnames, and kubedns gives:
FWIW,
10.200.0.1
is the kube api service internally,10.200.0.5
is DNS,10.50.20.0/24
and10.50.21.0/24
are the subnets (2 separate AZs) on which masters and workers run.Something just really fubar in the networking?
deitch commentedon May 5, 2017
@bboreham could this be related to weave and not kube (or at least misconfigured weave)? Standard weave with the
IPALLOC_RANGE=10.100.0.0/16
added as discussed at weaveworks/weave#2736qiujian16 commentedon May 11, 2017
@deitch pleg is for kubelet to periodically list pods in the node to check healthy and update cache. If you see pleg timeout log, it may not be related to dns, but because kubelet's call to docker is timeout.
deitch commentedon May 11, 2017
Thanks @qiujian16 . The issue appears to have gone away, but I have no idea how to check it. Docker itself appeared healthy. I was wondering if it could be networking plugin, but that should not affect the kubelet itself.
Can you give me some pointers here on checking pleg healthiness and status? Then we can close this out until I see the issue recur.
qiujian16 commentedon May 11, 2017
@deitch pleg is the short for "pod lifecycle event generator", it is an internal component of kubelet and i do not think you can directly check its status, see (https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md)
deitch commentedon May 11, 2017
Is it an internal module in the kubelet binary? Is it another standalone container (docker, runc, cotnainerd)? It is just a standalone binary?
Basically, if kubelet reports PLEG errors, it is very helpful to find out what those errors are, and then check its status, try and replicate.
qiujian16 commentedon May 11, 2017
it is an internal module
yujuhong commentedon May 11, 2017
@deitch most likely docker was not as responsive at times, causing PLEG to miss its threshold.
bjhaid commentedon May 11, 2017
I am having a similar issue on all nodes but one a cluster I just created,
logs:
I have downgraded docker and also restarted virtually everything to no avail, the nodes are all managed via puppet, so I expect them to be completely identical, I have no clue what is wrong. Docker logs in debug mode shows it's getting these requests
deitch commentedon May 11, 2017
@bjhaid what are you using for networking? I was seeing some interesting networking issues at the time.
bjhaid commentedon May 11, 2017
@deitch weave, but I don't think this is a networking related problem, since it seems to be a communication problem between kubelet and docker. I can confirm docker is getting these requests from kubelet via debug logging of docker
deitch commentedon May 11, 2017
My Pleg issues appear to be gone, although I won't feel confident until next time I set up these clusters afresh (all via terraform modules I built).
Weave issues appear to exist, or possibly k8s/docker.
287 remaining items