New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node flapping between Ready/NotReady with PLEG issues #45419
Comments
@deitch, how many containers were running on the node? What's the overall cpu utilization of your nodes? |
Basically none. kube-dns, weave-net, weave-npc, and 3 template sample services. Actually only one, because two had no image and were going to be cleaned up. AWS m4.2xlarge. Not a resource issue. I ended up having to destroy the nodes and recreate. No PLEG messages since destroy/recreate, and they seem 50% ok. They stay I really struggled to find any documentation on what PLEG really is, but more importantly how to check its own logs and state and debug it. |
Hmm... to add to the mystery, no container can resolve any hostnames, and kubedns gives:
FWIW, Something just really fubar in the networking? |
@bboreham could this be related to weave and not kube (or at least misconfigured weave)? Standard weave with the |
@deitch pleg is for kubelet to periodically list pods in the node to check healthy and update cache. If you see pleg timeout log, it may not be related to dns, but because kubelet's call to docker is timeout. |
Thanks @qiujian16 . The issue appears to have gone away, but I have no idea how to check it. Docker itself appeared healthy. I was wondering if it could be networking plugin, but that should not affect the kubelet itself. Can you give me some pointers here on checking pleg healthiness and status? Then we can close this out until I see the issue recur. |
@deitch pleg is the short for "pod lifecycle event generator", it is an internal component of kubelet and i do not think you can directly check its status, see (https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md) |
Is it an internal module in the kubelet binary? Is it another standalone container (docker, runc, cotnainerd)? It is just a standalone binary? Basically, if kubelet reports PLEG errors, it is very helpful to find out what those errors are, and then check its status, try and replicate. |
it is an internal module |
@deitch most likely docker was not as responsive at times, causing PLEG to miss its threshold. |
I am having a similar issue on all nodes but one a cluster I just created,
I have downgraded docker and also restarted virtually everything to no avail, the nodes are all managed via puppet, so I expect them to be completely identical, I have no clue what is wrong. Docker logs in debug mode shows it's getting these requests |
@bjhaid what are you using for networking? I was seeing some interesting networking issues at the time. |
@deitch weave, but I don't think this is a networking related problem, since it seems to be a communication problem between kubelet and docker. I can confirm docker is getting these requests from kubelet via debug logging of docker |
My Pleg issues appear to be gone, although I won't feel confident until next time I set up these clusters afresh (all via terraform modules I built). Weave issues appear to exist, or possibly k8s/docker. |
@deitch did you do anything to make the Pleg issues go away or they magic happened? |
Actually it's hostname resolution, the controllers could not resolve the hostname for the newly created nodes, sorry for the noise |
I was quick to report things being fine, problem still exists, I'll keep looking and report back if I find anything |
I guess this issue is related to So I have no clue why or how by I'm pretty sure it's due to |
Forgot to return here, my problem was due to the weave interface not coming up so the containers didn't have networking, however this was due to our firewall blocking weave data and vxlan ports, once I opened this ports things were fine |
There were two sets of issues I had, possibly related.
Suspiciously, all of the issues with pleg happened at exactly the same time as the weave network issues. Bryan @ weaveworks, pointed me to the coreos issues. CoreOS has a rather aggressive tendency to try and manage bridges, veths, basically everything. Once I disabled CoreOS from doing it except on Are the people still with problems running coreos? |
We've been plagued by these issues for the last month or so, (I want to say after upgrading the clusters to 1.6.x from 1.5.x) and its just as mysterious. we're running weave, debian jessie AMIs in aws, and every once in a while a cluster will decide that PLEG is not healthy. Weave seems okay in this case, because pods are coming up fine up util a point. docker ps; docker info seem fine on the node. We cannot get our head around this at all. I really wish PLEG check was a little more verbose, and we had some actual detailed docu about what the beep its doing, because there seem to be a HUGE number of issues open about it, no one really knows what it is, and for such a critical module, i would love ot be able to reproduce the checks that it sees as failing. |
I second the thoughts on pleg mysteriousness. On my end though, after much work for my client, stabilizing coreos and its misbehaviour with networks helped a lot. |
The PLEG health check does very little. In every iteration, it calls Unless your node is loaded with huge number of pods that PLEG can't finish doing all these in 3 minutes (which should not happen), the most probable cause would be that docker is slow. You may not observe that in your occasional If we don't expose the "unhealthy" status, it'd hide many problems from the users and potentially cause more issue. For example, kubelet'd silently not reacting to changes in a timely manner and cause even more confusion. Suggestions on how to make this more debuggable are welcome... |
Running into PLEG unhealthy warnings and flapping node health status : k8s 1.6.4 with weave. Only appears on a subset of (otherwise identical) nodes. |
Just a quick heads up, in our case the flapping workers and pods stuck in ContainerCreating was a problem with the security groups of our EC2 instances not allowing weave traffic between master and workers and among the workers. Therefore the node could not properly come up and got stuck in NotReady. kuberrnetes 1.6.4 with proper security group it works now. |
I am experiencing something like this issue with this config... Kubernetes version (use kubectl version): 1.6.4 Environment: Since this is a single-node cluster, I don't think my version of this issue is related to security groups or firewalls. |
The issue with security groups would make sense if you're just starting up the cluster. But these issues we're seeing are on clusters that have been running for months, with security groups in place. |
I had something similar just happen to me running kubelet version 1.6.2 on GKE. One of our nodes got shifted into a not ready state, the kubelet logs on that node had two complaints, one that the PLEG status check failed, and two interestingly, that the image listing operations failed. Some examples of which image function calls failed. Which I'm assuming are calls to the the docker daemon. As this was happening I saw disk IO spike a lot, especially the read operations. From ~50kb/s mark to 8mb/s mark. It corrected itself after about 30-45 minutes, but maybe it was a image GC sweep causing the increased IO? As has been said, PLEG monitors the pods through the docker daemon, if that's doing a lot of operations art once could the PLEG checks be queued? |
I'm seeing this problem in 1.6.4 and 1.6.6 (on GKE) with flapping NotReady as the result. Since this is the latest version available on GKE I'd love to have any fixes backported to the next 1.6 release. One interesting thing is that the time that PLEG was last seen active doesn't change and is always a huge number (perhaps it's at some limit of whatever type it's stored in).
|
that PR is in v219-65 already, if systemd is above that version, need look for other aspect. |
I found a trigger for this for me has been when the quay.io/kubernetes_incubator/nfs-provisioner:latest based ReadWriteMany that longhorn suggests dies. The nfs pod being gone breaks upsets the kernel so things like df hang but also docker inspect on the container that mounted the nfs volume. Now what kills the nfs pod....I have no idea. FWIW I'm using canal not weave. |
IMPORTANT EDIT: After removing components of my cluster one by one, it seems that the issue comes from a version mismatch may be (OS or K8 version). After a downgrade to K8 1.18.16 from 1.20.4 and also a downgrade from Ubuntu 20.04 to 18.04 node flapping is gone and the cluster pass my load test (many Statefulset with PVCs). The node flapping is reproducible with In my case Issue reported on Rancher repo also: rancher/rancher#31793 OLD ANSWER: When i reach about 10 cron jobs (10 pods created each min) after 5-6min PLEG error is raised (theoretically 60 containers, 120 containers with pause containers). My cluster is:
|
@lingwooc in Longhorn v1.1 we natively support rwx via a custom ganesha image, with NFS soft mount, so you shouldn't have any hangs in the case where the nfs server is gone. |
thanks @joshimoo i was wrong and it took me a while to find the final source. Longhorn and Istio work perfectly, the issue seems to be deeper. |
Quoting myself for visibility (2019), this issue is 5 years old (2017), a dozen versions have been released since, I think this is not a good place anymore to discuss PLEG issues. The root cause of this issue may be completely different depending on your setup/environment. I just wonder if removing the Docker daemon from the equation helps with the recent container runtimes related changes. |
FYI we had this happen because of a containerd update from 1.4.3 to 1.4.4 (can see what was used to compile docker with |
Thank you @grosser ! After your message, we deployed a test cluster (we currently are in K8S 1.19.8) with half the nodes with containerd 1.4.4 and the other half in container 1.4.3 and we managed to reproduce the issue by updating frequently the number of pods on specific nodes (scale up and down a simple deployment with a node selector to overload targeted nodes). This caused the PLEG duration to go up to 10s quite quickly on the targeted nodes. And we confirm that the nodes with containerd 1.4.4 changed to "NotReady" after 10 minutes of PLEG duration alerts whereas the nodes with containerd 1.4.3, even with the PLEG duration alerts, manage to stay ready all the time. If it can be useful, here is the test we made to reproduce the problem :
|
Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92. You can downgrade containerd.io to version 1.4.3-1 also, it will contain working version of runc. |
I'm not able to downgrade runC. Can you post a small how to, please? |
What I did: Before doing it you should stop the docker. Check if it was installed via if yes then remove it: If not listed, then run: it will show you where it is installed, then remove that folder: Then I installed specific version using blog: Installation of Go language library is required. Potentially it can also be installed via
Maybe someone else can help and let us know if it is possible to see more versions under |
We are running some nodes on version 1.0.1-dev (nodes were created ~24 hours ago) and others on 1.0.2-dev (nodes were created ~6 hours ago). The former doesn't seem to have any issue. While the latter is experiencing the problems highlighted in this issue. I've installed the
How is this version mismatch even possible? |
If you are using ubuntu and the docker apt repository, you need to downgrade containerd.io to version 1.4.3-1 |
why use 'component "runC" (1.0.0-rc93) of "containerd.io" ' cause 'node flapping between ready and not ready' ? |
There is some bug in the code, which is causing containers to be stuck in "Created" status and commands like “docker inspect" on those stucked containers hungs for a long time. Pleg is using this command to store status of containers and beacuse of that very delayed response it is crosing the timeout which is 3 minutes and node becomes not healthy. If you are looking for code related explanation you should go to git repo of runc. It is already fixed in the master branch, I am using version 93rc+dev so latest from master branch and issue is not there anymore. |
Okay so it seems that here it's "just" a version combination issue. Runc version:
So here it seems that 2 solutions exists:
|
We had runc 1.0.0-rc93 on Ubuntu 18.04 (first), then on Ubuntu 20.04 and it did cause issues. Downgrading via Docker's own apt repositories to We initially had a strong suspicion that it was caused by CSI drivers as we had recently upgraded those, so I disabled CSI entirely. When the cluster was running without CSI volumes (fewer containers as some where in So don't be discouraged to try a downgrade of runc/containerd if this hits you. The issue will probably not occur for everyone (see @Moumouls 's comment above although he has the affected version running). |
A downgrade of runc/containerd fixed this issue on 3 clusters we maintain. All were deployed this year. If you're experiencing this issue, trying to downgrade is definitely worth it |
I am seeing a similar issue with runc version |
As suggested workaround at #45419 (comment), downgrade containerd.io to the ATTENTION: could cause restarts of potential all pods in your cluster! #!/bin/bash
cd $(dirname $(realpath $0))
FOLDER=$(pwd)
user='root'
set -euo pipefail
kubectl get nodes --no-headers -o wide | awk '{print $7}' > $FOLDER/hosts.txt
cat $FOLDER/hosts.txt
### exclude '#' lines
grep -v '#' $FOLDER/hosts.txt| while read -r host; do
echo "$user@$host"
ssh "$user@$host" -oStrictHostKeyChecking=no -t 'bash -s' <<EOF
echo '------------------'
runc -version
echo '>>>>> apt install containerd.io=1.4.3-1*'
apt-get install --allow-downgrades --allow-change-held-packages -y containerd.io=1.4.3-1*
runc -version
EOF
done |
An upgrade to containerd 1.4.6-1 can also fix this issue. |
I just experienced this issue with Kubernetes 1.22.0, running on Ubuntu 20.04.3, when using the Ubuntu-provided docker.io package (docker 20.10.7, containerd 1.5.2, runc 1.0.0~rc95). For me, it appears to have been fixed by switching to the official Docker repo (docker 20.10.8, containerd 1.4.9, runc 1.0.1). |
FYI: Here is a resource I found that summarizes this issue. It has the following shell script to validate if the docker is slow and show which containers are the culprits. For Docker TIMEFORMAT=%R; time docker ps --format "{{.ID}}\t{{.Names}}" | while read id name; do echo -e "\nChecking Container: $name : $id"; RESP=$(time docker inspect $id 2>&1 > /dev/null); echo -e "Took$RESP above secs for $name ID: $id \n"; done; echo -e "Total Time" |
Is this a request for help? No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): PLEG NotReady kubelet
Is this a BUG REPORT or FEATURE REQUEST? Bug
Kubernetes version (use
kubectl version
): 1.6.2Environment:
uname -a
): 4.9.24-coreosWhat happened:
I have a 3-worker cluster. Two and sometimes all three nodes keep dropping into
NotReady
with the following messages injournalctl -u kubelet
:docker daemon is fine (local
docker ps
,docker images
, etc. all work and respond immediately).using weave networking installed via
kubectl apply -f https://git.io/weave-kube-1.6
What you expected to happen:
Nodes to be ready.
How to reproduce it (as minimally and precisely as possible):
Wish I knew how!
Anything else we need to know:
All of the nodes (workers and masters) on same private subnet with NAT gateway to Internet. Workers in security group that allows unlimited access (all ports) from masters security group; masters allow all ports from same subnet. proxy is running on workers; apiserver, controller-manager, scheduler on masters.
kubectl logs
andkubectl exec
always hang, even when run from the master itself (or from outside).The text was updated successfully, but these errors were encountered: