Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node flapping between Ready/NotReady with PLEG issues #45419

Closed
deitch opened this issue May 5, 2017 · 249 comments
Closed

Node flapping between Ready/NotReady with PLEG issues #45419

deitch opened this issue May 5, 2017 · 249 comments
Labels
area/reliability kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Milestone

Comments

@deitch
Copy link
Contributor

deitch commented May 5, 2017

Is this a request for help? No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): PLEG NotReady kubelet


Is this a BUG REPORT or FEATURE REQUEST? Bug

Kubernetes version (use kubectl version): 1.6.2

Environment:

  • Cloud provider or hardware configuration: CoreOS on AWS
  • OS (e.g. from /etc/os-release):CoreOS 1353.7.0
  • Kernel (e.g. uname -a): 4.9.24-coreos
  • Install tools:
  • Others:

What happened:

I have a 3-worker cluster. Two and sometimes all three nodes keep dropping into NotReadywith the following messages in journalctl -u kubelet:

May 05 13:59:56 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 13:59:56.872880    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 13:59:56 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 13:59:56.872908    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 13:59:56.872865742 +0000 UTC LastTransitionTime:2017-05-05 13:59:56.872865742 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.629592089s ago; threshold is 3m0s}
May 05 14:07:57 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:07:57.598132    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:07:57 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:07:57.598162    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:07:57.598117026 +0000 UTC LastTransitionTime:2017-05-05 14:07:57.598117026 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.346983738s ago; threshold is 3m0s}
May 05 14:17:58 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:17:58.536101    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:17:58 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:17:58.536134    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:17:58.536086605 +0000 UTC LastTransitionTime:2017-05-05 14:17:58.536086605 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.275467289s ago; threshold is 3m0s}
May 05 14:29:59 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:29:59.648922    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:29:59 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:29:59.648952    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:29:59.648910669 +0000 UTC LastTransitionTime:2017-05-05 14:29:59.648910669 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.377520804s ago; threshold is 3m0s}
May 05 14:44:00 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:44:00.938266    2858 kubelet_node_status.go:379] Recording NodeNotReady event message for node ip-10-50-20-208.ec2.internal
May 05 14:44:00 ip-10-50-20-208.ec2.internal kubelet[2858]: I0505 14:44:00.938297    2858 kubelet_node_status.go:682] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2017-05-05 14:44:00.938251338 +0000 UTC LastTransitionTime:2017-05-05 14:44:00.938251338 +0000 UTC Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m7.654775919s ago; threshold is 3m0s}

docker daemon is fine (local docker ps, docker images, etc. all work and respond immediately).

using weave networking installed via kubectl apply -f https://git.io/weave-kube-1.6

What you expected to happen:

Nodes to be ready.

How to reproduce it (as minimally and precisely as possible):

Wish I knew how!

Anything else we need to know:

All of the nodes (workers and masters) on same private subnet with NAT gateway to Internet. Workers in security group that allows unlimited access (all ports) from masters security group; masters allow all ports from same subnet. proxy is running on workers; apiserver, controller-manager, scheduler on masters.

kubectl logs and kubectl exec always hang, even when run from the master itself (or from outside).

@yujuhong yujuhong added the sig/node Categorizes an issue or PR as relevant to SIG Node. label May 5, 2017
@yujuhong
Copy link
Contributor

yujuhong commented May 5, 2017

@deitch, how many containers were running on the node? What's the overall cpu utilization of your nodes?

@deitch
Copy link
Contributor Author

deitch commented May 5, 2017

Basically none. kube-dns, weave-net, weave-npc, and 3 template sample services. Actually only one, because two had no image and were going to be cleaned up. AWS m4.2xlarge. Not a resource issue.

I ended up having to destroy the nodes and recreate. No PLEG messages since destroy/recreate, and they seem 50% ok. They stay Ready, although they still refuse to allow kubectl exec or kubectl logs.

I really struggled to find any documentation on what PLEG really is, but more importantly how to check its own logs and state and debug it.

@deitch
Copy link
Contributor Author

deitch commented May 5, 2017

Hmm... to add to the mystery, no container can resolve any hostnames, and kubedns gives:

E0505 17:30:49.412272       1 reflector.go:199] pkg/dns/config/sync.go:114: Failed to list *api.ConfigMap: Get https://10.200.0.1:443/api/v1/namespaces/kube-system/configmaps?fieldSelector=metadata.name%3Dkube-dns&resourceVersion=0: dial tcp 10.200.0.1:443: getsockopt: no route to host
E0505 17:30:49.412285       1 reflector.go:199] pkg/dns/dns.go:148: Failed to list *api.Service: Get https://10.200.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.200.0.1:443: getsockopt: no route to host
E0505 17:30:49.412272       1 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.200.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.200.0.1:443: getsockopt: no route to host
I0505 17:30:51.855370       1 logs.go:41] skydns: failure to forward request "read udp 10.100.0.3:60364->10.50.0.2:53: i/o timeout"

FWIW, 10.200.0.1 is the kube api service internally, 10.200.0.5 is DNS, 10.50.20.0/24 and 10.50.21.0/24 are the subnets (2 separate AZs) on which masters and workers run.

Something just really fubar in the networking?

@deitch
Copy link
Contributor Author

deitch commented May 5, 2017

Something just really fubar in the networking?

@bboreham could this be related to weave and not kube (or at least misconfigured weave)? Standard weave with the IPALLOC_RANGE=10.100.0.0/16 added as discussed at weaveworks/weave#2736

@qiujian16
Copy link
Contributor

@deitch pleg is for kubelet to periodically list pods in the node to check healthy and update cache. If you see pleg timeout log, it may not be related to dns, but because kubelet's call to docker is timeout.

@deitch
Copy link
Contributor Author

deitch commented May 11, 2017

Thanks @qiujian16 . The issue appears to have gone away, but I have no idea how to check it. Docker itself appeared healthy. I was wondering if it could be networking plugin, but that should not affect the kubelet itself.

Can you give me some pointers here on checking pleg healthiness and status? Then we can close this out until I see the issue recur.

@qiujian16
Copy link
Contributor

@deitch pleg is the short for "pod lifecycle event generator", it is an internal component of kubelet and i do not think you can directly check its status, see (https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-lifecycle-event-generator.md)

@deitch
Copy link
Contributor Author

deitch commented May 11, 2017

Is it an internal module in the kubelet binary? Is it another standalone container (docker, runc, cotnainerd)? It is just a standalone binary?

Basically, if kubelet reports PLEG errors, it is very helpful to find out what those errors are, and then check its status, try and replicate.

@qiujian16
Copy link
Contributor

it is an internal module

@yujuhong
Copy link
Contributor

@deitch most likely docker was not as responsive at times, causing PLEG to miss its threshold.

@bjhaid
Copy link
Contributor

bjhaid commented May 11, 2017

I am having a similar issue on all nodes but one a cluster I just created,
logs:

kube-worker03.foo.bar.com kubelet[3213]: E0511 19:00:59.139374    3213 remote_runtime.go:109] StopPodSandbox "12c6a5c6833a190f531797ee26abe06297678820385b402371e196c69b67a136" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
May 11 19:00:59 kube-worker03.foo.bar.com kubelet[3213]: E0511 19:00:59.139401    3213 kuberuntime_gc.go:138] Failed to stop sandbox "12c6a5c6833a190f531797ee26abe06297678820385b402371e196c69b67a136" before removing: rpc error: code = 4 desc = context deadline exceeded
May 11 19:01:04 kube-worker03.foo.bar.com kubelet[3213]: E0511 19:01:04.627954    3213 pod_workers.go:182] Error syncing pod 1c43d9b6-3672-11e7-a6da-00163e041106
("kube-dns-4240821577-1wswn_kube-system(1c43d9b6-3672-11e7-a6da-00163e041106)"), skipping: rpc error: code = 4 desc = context deadline exceeded
May 11 19:01:18 kube-worker03.foo.bar.com kubelet[3213]: E0511 19:01:18.627819    3213 pod_workers.go:182] Error syncing pod 1c43d9b6-3672-11e7-a6da-00163e041106
("kube-dns-4240821577-1wswn_kube-system(1c43d9b6-3672-11e7-a6da-00163e041106)"),
skipping: rpc error: code = 4 desc = context deadline exceeded
May 11 19:01:21 kube-worker03.foo.bar.com kubelet[3213]: I0511 19:01:21.627670    3213 kubelet.go:1752] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.339074625s ago; threshold is 3m0s]

I have downgraded docker and also restarted virtually everything to no avail, the nodes are all managed via puppet, so I expect them to be completely identical, I have no clue what is wrong. Docker logs in debug mode shows it's getting these requests

@deitch
Copy link
Contributor Author

deitch commented May 11, 2017

@bjhaid what are you using for networking? I was seeing some interesting networking issues at the time.

@bjhaid
Copy link
Contributor

bjhaid commented May 11, 2017

@deitch weave, but I don't think this is a networking related problem, since it seems to be a communication problem between kubelet and docker. I can confirm docker is getting these requests from kubelet via debug logging of docker

@deitch
Copy link
Contributor Author

deitch commented May 11, 2017

My Pleg issues appear to be gone, although I won't feel confident until next time I set up these clusters afresh (all via terraform modules I built).

Weave issues appear to exist, or possibly k8s/docker.

@bjhaid
Copy link
Contributor

bjhaid commented May 11, 2017

@deitch did you do anything to make the Pleg issues go away or they magic happened?

@bjhaid
Copy link
Contributor

bjhaid commented May 11, 2017

Actually it's hostname resolution, the controllers could not resolve the hostname for the newly created nodes, sorry for the noise

@bjhaid
Copy link
Contributor

bjhaid commented May 11, 2017

I was quick to report things being fine, problem still exists, I'll keep looking and report back if I find anything

@gbergere
Copy link

I guess this issue is related to weave-kube I had the same issue and this time in order to solve it without recreating the cluster I had to remove weave and re-apply it (with a reboot of node in order to propagate the remove order)... And it's back

So I have no clue why or how by I'm pretty sure it's due to weave-kube-1.6

@bjhaid
Copy link
Contributor

bjhaid commented May 19, 2017

Forgot to return here, my problem was due to the weave interface not coming up so the containers didn't have networking, however this was due to our firewall blocking weave data and vxlan ports, once I opened this ports things were fine

@deitch
Copy link
Contributor Author

deitch commented May 19, 2017

There were two sets of issues I had, possibly related.

  1. PLEG. I believe they have gone away, but I have not recreated enough clusters to be completely confident. I do not believe I changed very much (i.e. anything) directly to make that happen.
  2. Weave issues wherein containers were unable to connect to anything.

Suspiciously, all of the issues with pleg happened at exactly the same time as the weave network issues.

Bryan @ weaveworks, pointed me to the coreos issues. CoreOS has a rather aggressive tendency to try and manage bridges, veths, basically everything. Once I disabled CoreOS from doing it except on lo and actually physical interfaces on the host, all of my problems left.

Are the people still with problems running coreos?

@hollowimage
Copy link

We've been plagued by these issues for the last month or so, (I want to say after upgrading the clusters to 1.6.x from 1.5.x) and its just as mysterious.

we're running weave, debian jessie AMIs in aws, and every once in a while a cluster will decide that PLEG is not healthy.

Weave seems okay in this case, because pods are coming up fine up util a point.
One thing we noted is if scale ALL our replicas down, the issue seems to go away, but as we start scaling deployments and statefulsets back up, around a certain number of containers, this happens. (at least this time).

docker ps; docker info seem fine on the node.
resource utilization is nominal: 5% cpu util, 1.5/8gb of RAM used (according to root htop), total node resource provisioning sits around 30% with everything that's supposed to be scheduled on it, scheduled.

We cannot get our head around this at all.

I really wish PLEG check was a little more verbose, and we had some actual detailed docu about what the beep its doing, because there seem to be a HUGE number of issues open about it, no one really knows what it is, and for such a critical module, i would love ot be able to reproduce the checks that it sees as failing.

@deitch
Copy link
Contributor Author

deitch commented May 26, 2017

I second the thoughts on pleg mysteriousness. On my end though, after much work for my client, stabilizing coreos and its misbehaviour with networks helped a lot.

@yujuhong
Copy link
Contributor

The PLEG health check does very little. In every iteration, it calls docker ps to detect container states changes, and call docker ps and inspect to get the details of those containers.
After finishing each iteration, it updates a timestamp. If the timestamp hasn't been updated for a while (i.e., 3 minutes), the health check fails.

Unless your node is loaded with huge number of pods that PLEG can't finish doing all these in 3 minutes (which should not happen), the most probable cause would be that docker is slow. You may not observe that in your occasional docker ps check, but that doesn't mean it's not there.

If we don't expose the "unhealthy" status, it'd hide many problems from the users and potentially cause more issue. For example, kubelet'd silently not reacting to changes in a timely manner and cause even more confusion.

Suggestions on how to make this more debuggable are welcome...

@anurag
Copy link

anurag commented May 27, 2017

Running into PLEG unhealthy warnings and flapping node health status : k8s 1.6.4 with weave. Only appears on a subset of (otherwise identical) nodes.

@agabert
Copy link

agabert commented Jun 1, 2017

Just a quick heads up, in our case the flapping workers and pods stuck in ContainerCreating was a problem with the security groups of our EC2 instances not allowing weave traffic between master and workers and among the workers. Therefore the node could not properly come up and got stuck in NotReady.

kuberrnetes 1.6.4

with proper security group it works now.

@wirehead
Copy link

wirehead commented Jun 1, 2017

I am experiencing something like this issue with this config...

Kubernetes version (use kubectl version): 1.6.4

Environment:
Cloud provider or hardware configuration: single System76 server
OS (e.g. from /etc/os-release): Ubuntu 16.04.2 LTS
Kernel (e.g. uname -a): Linux system76-server 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm + weave.works

Since this is a single-node cluster, I don't think my version of this issue is related to security groups or firewalls.

@hollowimage
Copy link

The issue with security groups would make sense if you're just starting up the cluster. But these issues we're seeing are on clusters that have been running for months, with security groups in place.

@zoltrain
Copy link

zoltrain commented Jun 2, 2017

I had something similar just happen to me running kubelet version 1.6.2 on GKE.

One of our nodes got shifted into a not ready state, the kubelet logs on that node had two complaints, one that the PLEG status check failed, and two interestingly, that the image listing operations failed.

Some examples of which image function calls failed.
image_gc_manager.go:176
kuberuntime_image.go:106
remote_image.go:61

Which I'm assuming are calls to the the docker daemon.

As this was happening I saw disk IO spike a lot, especially the read operations. From ~50kb/s mark to 8mb/s mark.

It corrected itself after about 30-45 minutes, but maybe it was a image GC sweep causing the increased IO?

As has been said, PLEG monitors the pods through the docker daemon, if that's doing a lot of operations art once could the PLEG checks be queued?

@bergman
Copy link

bergman commented Jun 26, 2017

I'm seeing this problem in 1.6.4 and 1.6.6 (on GKE) with flapping NotReady as the result. Since this is the latest version available on GKE I'd love to have any fixes backported to the next 1.6 release.

One interesting thing is that the time that PLEG was last seen active doesn't change and is always a huge number (perhaps it's at some limit of whatever type it's stored in).

[container runtime is down PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]

@danielzhanghl
Copy link

The problem could be also caused by old version systemd, try upgrade systemd.

ref:
https://my.oschina.net/yunqi/blog/3041189 (chinese only)
lnykryn/systemd-rhel#322

that PR is in v219-65 already, if systemd is above that version, need look for other aspect.
redhat-plumbers/systemd-rhel7@ac46d01

@lingwooc
Copy link

lingwooc commented Mar 3, 2021

I found a trigger for this for me has been when the quay.io/kubernetes_incubator/nfs-provisioner:latest based ReadWriteMany that longhorn suggests dies. The nfs pod being gone breaks upsets the kernel so things like df hang but also docker inspect on the container that mounted the nfs volume. Now what kills the nfs pod....I have no idea. FWIW I'm using canal not weave.

@Moumouls
Copy link

Moumouls commented Mar 20, 2021

IMPORTANT EDIT: After removing components of my cluster one by one, it seems that the issue comes from a version mismatch may be (OS or K8 version). After a downgrade to K8 1.18.16 from 1.20.4 and also a downgrade from Ubuntu 20.04 to 18.04 node flapping is gone and the cluster pass my load test (many Statefulset with PVCs). The node flapping is reproducible with Rancher RKE Cluster 1.20.4/Ubuntu 20.04/3 Nodes deploy and many statefulsets (20+). Also i'm not sure but CronJobs seems to trigger the PLEG Error.

In my case docker rm <container hanging in docker inspect> -f do not work, the solution i found is to delete the namespace or restart docker or reboot the node.

Issue reported on Rancher repo also: rancher/rancher#31793

OLD ANSWER:
On my side i have many apps with a cron each minute (* * * * *) with a 60s active deadline.

When i reach about 10 cron jobs (10 pods created each min) after 5-6min PLEG error is raised (theoretically 60 containers, 120 containers with pause containers).
After some investigation with docker ps -a | tr -s " " | cut -d " " -f1 | xargs -Iarg sh -c 'echo arg; docker inspect arg> /dev/null' i found that some "pause" container (in my case rancher/pause container) hangs on docker inspect.

My cluster is:

  • Ubuntu 20
  • Docker 20.10.5
  • RKE
  • Kubernetes : v1.20.4-rancher1-1
  • Nodes: 3 VPS, 24 Cores, 96GB RAM,
  • Core charts installed: Istio, Longhorn

@joshimoo
Copy link

@lingwooc in Longhorn v1.1 we natively support rwx via a custom ganesha image, with NFS soft mount, so you shouldn't have any hangs in the case where the nfs server is gone.

@Moumouls
Copy link

@lingwooc in Longhorn v1.1 we natively support rwx via a custom ganesha image, with NFS soft mount, so you shouldn't have any hangs in the case where the nfs server is gone.

thanks @joshimoo i was wrong and it took me a while to find the final source. Longhorn and Istio work perfectly, the issue seems to be deeper.

@MaesterZ
Copy link

MaesterZ commented Mar 23, 2021

From personal experience since 1.6 in production, PLEG issues usually show up when a node is drowning.
* Load is sky high, process is looping/consuming all the CPU resources
* Disk I/O is maxed (logging?)
* Global overload (CPU+disk+network) => CPU is interrupted all the time
Result => Docker daemon is not responsive

Quoting myself for visibility (2019), this issue is 5 years old (2017), a dozen versions have been released since, I think this is not a good place anymore to discuss PLEG issues. The root cause of this issue may be completely different depending on your setup/environment.

I just wonder if removing the Docker daemon from the equation helps with the recent container runtimes related changes.

@grosser
Copy link

grosser commented Mar 26, 2021

FYI we had this happen because of a containerd update from 1.4.3 to 1.4.4 (can see what was used to compile docker with docker version) ... still not sure why
see containerd/containerd#5274

@yogeek
Copy link

yogeek commented Mar 30, 2021

Thank you @grosser !

After your message, we deployed a test cluster (we currently are in K8S 1.19.8) with half the nodes with containerd 1.4.4 and the other half in container 1.4.3 and we managed to reproduce the issue by updating frequently the number of pods on specific nodes (scale up and down a simple deployment with a node selector to overload targeted nodes). This caused the PLEG duration to go up to 10s quite quickly on the targeted nodes.

And we confirm that the nodes with containerd 1.4.4 changed to "NotReady" after 10 minutes of PLEG duration alerts whereas the nodes with containerd 1.4.3, even with the PLEG duration alerts, manage to stay ready all the time.

If it can be useful, here is the test we made to reproduce the problem :

cat >>EOF | kubectl apply -f -
cat nginx-select-az.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-enforced-az
spec:
  selector:
    matchLabels:
      app: nginx-enforced-az
  template:
    metadata:
      labels:
        app: nginx-enforced-az
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: eu-central-1a
      containers:
      - name: nginx
      	# to avoid docker rate limiting, we used an other registry than dockerhub
        image: quay.io/yobasystems/alpine-nginx:x86_64
        resources:
          requests:
            memory: "32Mi"
            cpu: "50m"
          limits:
            memory: "100Mi"
            cpu: "100m"
        ports:
        - containerPort: 8080
EOF

while true; do 
	nb_pods_up=$(( ( RANDOM % 200 ) ))
	echo "Scaling to $nb_pods_up..."
	kubectl scale deploy nginx-enforced-az --replicas $nb_pods_up
	sleep 20
	echo "Scaling to $nb_pods_down..."
	nb_pods_down=$(( ( RANDOM % 50 ) ))
	kubectl scale deploy nginx-enforced-az --replicas $nb_pods_down
	sleep 20
done

image

@bbroniewski
Copy link

bbroniewski commented Apr 14, 2021

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92. You can downgrade containerd.io to version 1.4.3-1 also, it will contain working version of runc.

@jonathanheilmann
Copy link

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

I'm not able to downgrade runC. Can you post a small how to, please?

@bbroniewski
Copy link

bbroniewski commented Apr 20, 2021

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

I'm not able to downgrade runC. Can you post a small how to, please?

What I did:

Before doing it you should stop the docker.

Check if it was installed via apt (should appear on the list):
apt list --installed

if yes then remove it:
sudo apt-get purge runc

If not listed, then run:
which runc

it will show you where it is installed, then remove that folder:
rm -R <folder_name>

Then I installed specific version using blog:
https://dev.bitolog.com/upgrade-runc-on-ubuntu/

Installation of Go language library is required.

Potentially it can also be installed via apt-get, but I only have two versions under apt:

  • command
apt-cache madison runc
  • my output:
      runc | 1.0.0~rc93-0ubuntu1~20.04.1 | http://pl.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
      runc | 1.0.0~rc10-0ubuntu1 | http://pl.archive.ubuntu.com/ubuntu focal/main amd64 Packages

Maybe someone else can help and let us know if it is possible to see more versions under apt or they were somehow not distributed and can not be installed easily using apt-get.

@renan
Copy link
Contributor

renan commented Apr 20, 2021

We are running some nodes on version 1.0.1-dev (nodes were created ~24 hours ago) and others on 1.0.2-dev (nodes were created ~6 hours ago).

The former doesn't seem to have any issue. While the latter is experiencing the problems highlighted in this issue.

I've installed the docker.io package which then installs containerd and runc (1.0.0~rc93-0ubuntu1~20.04.1). Running on Ubuntu 20.04.2 LTS.

root@ip-10-203-0-12:~# dpkg -l | grep runc
ii  runc                              1.0.0~rc93-0ubuntu1~20.04.1       amd64        Open Container Project - runtime

root@ip-10-203-0-12:~# runc --version
runc version spec: 1.0.2-dev
go: go1.13.8
libseccomp: 2.4.3

How is this version mismatch even possible?

@giskou
Copy link

giskou commented Apr 20, 2021

If you are using ubuntu and the docker apt repository, you need to downgrade containerd.io to version 1.4.3-1

@wolfleave
Copy link

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

why use 'component "runC" (1.0.0-rc93) of "containerd.io" ' cause 'node flapping between ready and not ready' ?

@bbroniewski
Copy link

bbroniewski commented Apr 21, 2021

Hi all, be aware that component "runC" (1.0.0-rc93) of "containerd.io" which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

why use 'component "runC" (1.0.0-rc93) of "containerd.io" ' cause 'node flapping between ready and not ready' ?

There is some bug in the code, which is causing containers to be stuck in "Created" status and commands like “docker inspect" on those stucked containers hungs for a long time. Pleg is using this command to store status of containers and beacuse of that very delayed response it is crosing the timeout which is 3 minutes and node becomes not healthy. If you are looking for code related explanation you should go to git repo of runc. It is already fixed in the master branch, I am using version 93rc+dev so latest from master branch and issue is not there anymore.
Issue in runc repo: opencontainers/runc#2828

@Moumouls
Copy link

Okay so it seems that here it's "just" a version combination issue.
On our production system we have many Ubuntu 18.04 and runc version 1.0.0-rc93, everything works fine.

Runc version:

runc version 1.0.0-rc93
commit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
spec: 1.0.2-dev
go: go1.13.15
libseccomp: 2.4.3

So here it seems that 2 solutions exists:

  • Downgrade/Update runc version and avoid the rc93 runc version
  • Use an old Ubuntu 18.04 with runc rc 93

@pehlert
Copy link

pehlert commented Apr 26, 2021

We had runc 1.0.0-rc93 on Ubuntu 18.04 (first), then on Ubuntu 20.04 and it did cause issues. Downgrading via Docker's own apt repositories to containerd=1.4.3-1* that comes with runc rc92 apparently solved it for us.

We initially had a strong suspicion that it was caused by CSI drivers as we had recently upgraded those, so I disabled CSI entirely. When the cluster was running without CSI volumes (fewer containers as some where in ContrainerCreating state waiting for their volumes), the issue did not appear.

So don't be discouraged to try a downgrade of runc/containerd if this hits you. The issue will probably not occur for everyone (see @Moumouls 's comment above although he has the affected version running).

@superbiche
Copy link

A downgrade of runc/containerd fixed this issue on 3 clusters we maintain. All were deployed this year. If you're experiencing this issue, trying to downgrade is definitely worth it

@nmajin
Copy link

nmajin commented May 15, 2021

I am seeing a similar issue with runc version runc version spec: 1.0.1-dev. Although not sure if this is completely related, It appears twistlock daemonset could also be causing this issue. We are still digging, but just curious if any of this is related to the node Ready/NotReady issue I am seeing.

@toschneck
Copy link

As suggested workaround at #45419 (comment), downgrade containerd.io to the 1.4.3-1* version worked, what holds runc in version 1.0.0-rc92 fixed the issue, this small automated script for all ssh to all nodes, fixed it for now:

ATTENTION: could cause restarts of potential all pods in your cluster!

#!/bin/bash
cd $(dirname $(realpath $0))
FOLDER=$(pwd)
user='root'

set -euo pipefail
kubectl get nodes --no-headers -o wide | awk '{print $7}' > $FOLDER/hosts.txt
cat $FOLDER/hosts.txt

### exclude '#' lines
grep -v '#' $FOLDER/hosts.txt| while read -r host; do
  echo "$user@$host"
  ssh "$user@$host" -oStrictHostKeyChecking=no -t 'bash -s' <<EOF
echo '------------------'
runc -version
echo '>>>>> apt install containerd.io=1.4.3-1*'
apt-get install --allow-downgrades --allow-change-held-packages -y containerd.io=1.4.3-1*
runc -version
EOF
done

@avestuk
Copy link

avestuk commented Jun 10, 2021

An upgrade to containerd 1.4.6-1 can also fix this issue.

@bschofield
Copy link

I just experienced this issue with Kubernetes 1.22.0, running on Ubuntu 20.04.3, when using the Ubuntu-provided docker.io package (docker 20.10.7, containerd 1.5.2, runc 1.0.0~rc95).

For me, it appears to have been fixed by switching to the official Docker repo (docker 20.10.8, containerd 1.4.9, runc 1.0.1).

@chinmaya-n
Copy link

chinmaya-n commented Nov 29, 2022

FYI: Here is a resource I found that summarizes this issue. It has the following shell script to validate if the docker is slow and show which containers are the culprits.

For Docker

TIMEFORMAT=%R; time docker ps --format "{{.ID}}\t{{.Names}}" | while read id name; do echo -e "\nChecking Container: $name : $id"; RESP=$(time docker inspect $id 2>&1  > /dev/null); echo -e "Took$RESP above secs for $name ID: $id \n"; done; echo -e "Total Time"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.