Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PLEG is not healthy #61117

Closed
albertvaka opened this issue Mar 13, 2018 · 73 comments
Closed

PLEG is not healthy #61117

albertvaka opened this issue Mar 13, 2018 · 73 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@albertvaka
Copy link

albertvaka commented Mar 13, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I had a node missbehaving. journal of the kubelet showed this:

Mar 13 16:02:43 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:02:43.934473    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m5.30015447s ago; threshold is 3m0s]
Mar 13 16:02:48 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:02:48.934773    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m10.30056416s ago; threshold is 3m0s]
Mar 13 16:02:53 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:02:53.935030    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m15.300823802s ago; threshold is 3m0s]
Mar 13 16:02:58 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:02:58.935306    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m20.301094771s ago; threshold is 3m0s]
Mar 13 16:03:03 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:03:03.940675    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m25.306459083s ago; threshold is 3m0s]
Mar 13 16:03:08 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:03:08.940998    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m30.306781014s ago; threshold is 3m0s]
Mar 13 16:03:13 vk-prod4-node-v18-w3fz kubelet[1450]: I0313 16:03:13.941284    1450 kubelet.go:1778] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m35.307062566s ago; threshold is 3m0s]

Environment:

  • Kubernetes version (use kubectl version): 1.8.8
  • Cloud provider or hardware configuration: GCE
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.3 LTS
  • Kernel (e.g. uname -a): 4.13.0-1011-gcp
  • Install tools: kubeadm
  • Others:
    • Docker Version: 1.12.6-cs13
    • I'm using calico for networking
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Mar 13, 2018
@albertvaka
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 13, 2018
@albertvaka albertvaka changed the title Kubelet invalid memory address or nil pointer dereference PLEG is not healthy Mar 13, 2018
@andrewrynhard
Copy link
Contributor

@albertvaka Was this bug fixed, or why was the issue closed? I am seeing the same thing.

@oivindoh
Copy link

oivindoh commented May 8, 2018

also just had this wonderful experience (with calico 2.6) on one of my nodes in Azure

@albertvaka probably abandoned it in favour of #45419

@whereisaaron
Copy link

@albertvaka out of memory on a node can cause this, and when it happens the node can't update its own node status to say so. You might find the hyperkube process has leaked to use multiple GB of RAM. This might not be your problem but one idea to check.

@adampl
Copy link

adampl commented Aug 28, 2018

In my case (Kubernetes 1.11.1) the problem is not caused by insufficient amount of free memory.

@sfgroups-k8s
Copy link

I am getting same error message on centos 7 node, its has lot of free memory.

Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m2.007756043s ago; threshold is 3m0s}

@ArjunSureshKumar92
Copy link

ArjunSureshKumar92 commented Dec 20, 2018

@grailsweb Did you find a solution? Having the same issue in aks v1.11.

@ghost
Copy link

ghost commented Jan 7, 2019

Kubelet 1.12.3.

Have met this issue, not a memory problem. Will try to fit this on an open issue somewhere.

@maheshmadpathi
Copy link

Has this been fixed? we have the same issue with v1.12.0. Node Memory isn't a problem.

@javafoot
Copy link

I have the same issue on K8S v1.10.11 with health CPU, memory and disk.
v1.10.11 CentOS Linux 7 (Core) 3.10.0-862.14.4.el7.x86_64 docker://1.13.1
Docker works normally by running "docker ps or info".

By this way, the issue is fixed temporarily.

After restarting docker deamon, there was no container running, meanwhile, kubelet logs kept printing:
kubelet[788387]: W0412 00:46:38.054829 788387 pod_container_deletor.go:77] Container "621a89ecc8d299773098d740cf9057602df1f67aba6ba85b7cae88701a9b4b06" not found in pod's containers
kubelet[567286]: I0411 22:44:11.526244 567286 kubelet.go:1803] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 2h8m28.080201738s ago; threshold is 3m0s]

It gave me light ---- I may delete "/var/lib/kubelet/pods/*".
/var/lib/kubelet/pods/9a954722-5c0c-11e9-91fc-005056bd5f06:
containers etc-hosts plugins volumes

Step1. Stop kubelet
Step2. Remove pods, but the volume folders can't be removed at all:
rm -rf /var/lib/kubelet/pods/*
rm: cannot remove \u2018/var/lib/kubelet/pods/084cf8bd-5cd4-11e9-ad28-005056bd5f06/volumes/kubernetes.io~secret/default-token-c2xc7\u2019: Device or resource busy

Step3. Start kubelet

"docker ps" could list the containers.

The node came back to the "Ready" state.

@gridworkz
Copy link

gridworkz commented Apr 15, 2019

Thanks for this - I lost two nodes over the weekend - flapping. I'll try this next time it happens. I've also brought this up with the folks at OpenEBS just to notify them. They never heard of it. (not an OpenEBS issue but I know some guys over there who are much smarter than me).

@MohammedFadin
Copy link

I'm having the exact same issue on IBM Cloud Kubernetes as well.

@gridworkz
Copy link

gridworkz commented Jun 18, 2019 via email

@cshivashankar
Copy link

cshivashankar commented Jun 26, 2020

Is this issue resolved?
I am getting PLEG issues in my cluster and observed this open issue.
Is there any workaround for this?

@billysutomo
Copy link

I also got this issue. My cause is memory leak. I don't know how, but collectd got memory leak and eat my memory resources until 20GB and leave no resource for my kubernetes deployment. After check, i got Nodes not ready.

@TheKangaroo
Copy link

I see this issue on 1.17.6, with lots of free RAM on the host

free -g
              total        used        free      shared  buff/cache   available
Mem:             15           3           3           0           8          11
Swap:             0           0           0

@wwyhy
Copy link

wwyhy commented Oct 23, 2020

i got same issue on centos 7.6 with k8s 1.13.12 and docker 18.06.2-ce. it cause node reday/notready intermittently.
Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m2.007756043s ago; threshold is 3m0s}

@gautamsoni17990
Copy link

i am also facing the same issue on latest kubelet version which is 1.20.5 and still getting this error, there is no issue related to memory because i have not use any application till yet.

@gridworkz
Copy link

gridworkz commented Apr 12, 2021 via email

@rdxmb
Copy link

rdxmb commented Apr 27, 2021

Can this be opened or is there a new issue for that? The problem can actually be found on several systems like aws, rancher etc
rancher/rancher#31793

EDIT: #101056

@nswarnkar
Copy link

"PLEG is not healthy: pleg was last seen active 17m16.107709513s ago; threshold is 3m0s"

I am still facing this issue. This is fluctuating. Because of this some pods get stuck. I am also using Cluster Autoscaler, which starts adding node, once pods are not scheduled because of this node in error.

Any help or clue?

@gridworkz
Copy link

gridworkz commented Apr 30, 2021 via email

@gridworkz
Copy link

gridworkz commented May 5, 2021 via email

@gridworkz
Copy link

gridworkz commented May 5, 2021 via email

@nswarnkar
Copy link

I will check, But I am in complete agreement with your assessment. Because when PLEG issue comes, I saw the pods, though working perfectly, have their status "unknown". The moment, I remove the stuck pod, they status becomes "green" i.e. active.

@nswarnkar
Copy link

nswarnkar commented May 5, 2021

I tried this command:
kubectl get no ${NODE_NAME} -o json | jq '.status.config'

It returns null.

I checked kubectl version

kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:56:40Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.8", GitCommit:"fd5d41537aee486160ad9b5356a9d82363273721", GitTreeState:"clean", BuildDate:"2021-02-17T12:33:08Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

@gridworkz
Copy link

gridworkz commented May 5, 2021 via email

@nswarnkar
Copy link

I saw the logs of metric server and found line like:
E0503 18:34:06.511772 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 18:35:06.507445 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 18:36:06.507504 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 18:37:06.507423 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 18:38:06.507423 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 18:39:06.507403 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 18:39:48.455646 1 reststorage.go:135] unable to fetch node metrics for node "ip-10-0-32-228.ap-northeast-1.compute.internal": no metrics known for node
E0503 18:40:06.507445 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded
E0503 19:22:06.507434 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded

@gridworkz
Copy link

gridworkz commented May 5, 2021 via email

@gridworkz
Copy link

gridworkz commented May 6, 2021 via email

@Adam-Jin
Copy link

Adam-Jin commented May 6, 2021

I had a similar problem with AKS, where the node state switched between Ready and Not Ready. Pleg is not healthy, but the CPU memory consumption is not high. I contacted the Ask product team and they couldn't find Root Causel either. Since our developers frequently create new NS on the cluster and deploy a series of services in it for inheritance, the NS will be deleted at the end. This can happen during deployment.

@gridworkz
Copy link

gridworkz commented May 6, 2021 via email

@nswarnkar
Copy link

Try adding -n to the command. Namespace is probably kube-system

On Wed, May 5, 2021 at 5:46 PM Neeraj Swarnkar @.***> wrote: I tried this command: kubectl get no ${NODE_NAME} -o json | jq '.status.config' It returns null. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#61117 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVXNSDEZ65DKZY6QU3V2B3TMG4DTANCNFSM4EVD4O7Q .

Same output i.e. null

@nswarnkar
Copy link

curl https://10.0.32.228:10250/metrics

I can call curl -k https://10.0.32.228:10250/metrics

It said unauthorized.

@nswarnkar
Copy link

Interesting - what network plugin are you using? (Canal / Calico or other)

On Wed, May 5, 2021 at 6:10 PM Neeraj Swarnkar @.***> wrote: I saw the logs of metric server and found line like: E0503 18:34:06.511772 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 18:35:06.507445 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 18:36:06.507504 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 18:37:06.507423 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 18:38:06.507423 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 18:39:06.507403 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 18:39:48.455646 1 reststorage.go:135] unable to fetch node metrics for node "ip-10-0-32-228.ap-northeast-1.compute.internal": no metrics known for node E0503 18:40:06.507445 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded E0503 19:22:06.507434 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal (10.0.32.228): Get https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context deadline exceeded — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#61117 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVXNSF25LPCFMASHZSUXKLTMG63ZANCNFSM4EVD4O7Q .

Namespace: kube-system  
  Open accordion
  Open accordion

@nswarnkar
Copy link

I found pods running in kube-system:

calico-kube-controllers
and
canal

which CNI plugin is this, not sure, I could see both pods installed.

@nswarnkar
Copy link

nswarnkar commented May 6, 2021

As per the logs, I checked canal is being used on all worker and master node.
Interestingly I found that calico-kube-controllers both is deployed on master node only. Is it managing CNI plugins on all nodes?

@gridworkz
Copy link

gridworkz commented May 6, 2021 via email

@nswarnkar
Copy link

nswarnkar commented May 7, 2021

kubectl describe apiservice v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels:
Annotations: API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2021-03-22T11:20:24Z
Resource Version: 1491
Self Link: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io
UID: f7a1f1a9-b31f-441b-a920-01f57c5a1fdc
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2021-03-22T11:22:21Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events:

Status of my metric server is all passed.

@nswarnkar
Copy link

OK thanks, looks like a permission thing on /metrics. Here's something to look at: kubernetes-sigs/metrics-server#167 There are some switches that can be sent to kubernetes metrics server.

On Thu, May 6, 2021 at 2:46 PM Neeraj Swarnkar @.***> wrote: As per the logs, I checked canal is being used on all nodes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#61117 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVXNSFTRRDC5JPYDH4PK63TMLP2FANCNFSM4EVD4O7Q .

Same configuration is being used by my metrics serves( same switches)

@nswarnkar
Copy link

OK thanks, looks like a permission thing on /metrics. Here's something to look at: kubernetes-sigs/metrics-server#167 There are some switches that can be sent to kubernetes metrics server.

On Thu, May 6, 2021 at 2:46 PM Neeraj Swarnkar @.***> wrote: As per the logs, I checked canal is being used on all nodes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#61117 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEVXNSFTRRDC5JPYDH4PK63TMLP2FANCNFSM4EVD4O7Q .

Same configuration is being used by my metrics serves( same switches)

https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

@gridworkz
Copy link

Actually, the metrics server config looks OK. Your setup looks fine. I also have setups like this with no problems. None on Amazon. I think we may need to get them involved in this issue. In your case, it appears it might be AWS specific. You may need a higher access level to perform the things we need to do for diagnostics.

As I mentioned earlier, creating PODs incurs overhead (irrespective of resources RAM etc.). Mostly events and latency overhead. It's basically this overhead that's causing your PLEG issues when creating many pods on the cluster. In order to investigate this we need to fine tune things at a lower level, which I believe you don't have the access level to do. This is common with cloud providers as they don't really let you get under the hood. We have this issue a lot with AWS RDS.

If you're budget is constrained I would dig into this deeper with AWS (I think we narrowed down the things they can check in this thread), otherwise I would just increase the server size of your setup.

@nswarnkar
Copy link

nswarnkar commented May 7, 2021

Yes, actually to keep our budget low, we chose such machines. Now I can access the machines with higher access level. What set of debugging information we can collect?
If it is AWS infrastructure issue, we should have enough information to raise ticket on them. If you can guide me, I can collect. Thanks so much for following on this thread and replying me promptly,

@gridworkz
Copy link

gridworkz commented May 7, 2021 via email

@huangjiasingle
Copy link

huangjiasingle commented Sep 8, 2021

PLEG not healthy's real reason is not runc lead to,list container from dockerd, dockerd not comunite with runc, not healthy because last succeeded list container's time to now more than 3minute, PLEG list container from docker daemon by labels: all=1&filters={"label":{"io.kubernetes.docker.type=container":true},"status":{"running":true}}, and the docker damon first query containers from containerd, and containerd read go-memdb(memory db) and return containers to daemon, and daemon use the containers to
do for action and then get every container's image info from imagedb dir(usually /var/lib/docker/image/overlay2/imagedb,if use devicemap, the dir is /var/lib/docker/image/devicemap/imageddb), l think the reason of list container operate more than 3munite imposible occur in memdb, it high probability occur in get container's image info. may be has many container in the node or the node has many processes do io operation and so on.

@huangjiasingle
Copy link

the list api server of dockerd call link:
github.com/docker/docker/api/server/router/container/container.go initRoutes() > getContainersJSOJ()
github.com/docker/docker/api/server/router/container/container_routes.go getContainersJSOJ() > Containers()
github.com/docker/docker/daemon/list.go Containers() >reduceContainers() >reducePsContainer() > refreshImage()

@abhiatmsit
Copy link

is this issue fixed in kubernetes version
kubelet restart doesnt help

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:39:34Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
root@ubuntu-xenial:~#

i can still see this issue--

Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Sun, 12 Sep 2021 16:04:37 +0000 Sun, 12 Sep 2021 16:04:37 +0000 CalicoIsUp Calico is running on this node
MemoryPressure False Sun, 19 Sep 2021 09:43:17 +0000 Fri, 13 Aug 2021 06:09:37 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 19 Sep 2021 09:43:17 +0000 Fri, 13 Aug 2021 06:09:37 +0000 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 19 Sep 2021 09:43:17 +0000 Fri, 13 Aug 2021 06:09:37 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Sun, 19 Sep 2021 09:43:17 +0000 Sun, 19 Sep 2021 06:11:19 +0000 KubeletNotReady PLEG is not healthy: pleg has yet to be successful

root@ubuntu-xenial:~# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sun 2021-09-19 09:49:27 UTC; 2min 57s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 21878 (kubelet)
Tasks: 15
Memory: 77.1M
CPU: 19.599s
CGroup: /system.slice/kubelet.service
└─21878 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.5

lines 1-23/23 (END)
root@ubuntu-xenial:# free -mh
total used free shared buff/cache available
Mem: 4.8G 2.1G 266M 76M 2.4G 2.1G
Swap: 0B 0B 0B
root@ubuntu-xenial:
#

Sep 19 09:53:47 ubuntu-xenial systemd[1]: Removed slice libcontainer_31792_systemd_test_default.slice.
Sep 19 09:53:47 ubuntu-xenial kubelet[21878]: W0919 09:53:47.230337 21878 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/libcontainer_31792_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent /sys/fs/cgroup/pids/libcontainer_31792_systemd_test_default.slice: no such file or directory
Sep 19 09:53:47 ubuntu-xenial kubelet[21878]: W0919 09:53:47.230564 21878 watcher.go:95] Error while processing event ("/sys/fs/cgroup/cpu,cpuacct/libcontainer_31792_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/cpu,cpuacct/libcontainer_31792_systemd_test_default.slice: no such file or directory
Sep 19 09:53:47 ubuntu-xenial kubelet[21878]: W0919 09:53:47.230598 21878 watcher.go:95] Error while processing event ("/sys/fs/cgroup/blkio/libcontainer_31792_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/blkio/libcontainer_31792_systemd_test_default.slice: no such file or directory
Sep 19 09:53:47 ubuntu-xenial kubelet[21878]: W0919 09:53:47.230613 21878 watcher.go:95] Error while processing event ("/sys/fs/cgroup/memory/libcontainer_31792_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/memory/libcontainer_31792_systemd_test_default.slice: no such file or directory
Sep 19 09:53:47 ubuntu-xenial kubelet[21878]: W0919 09:53:47.230626 21878 watcher.go:95] Error while processing event ("/sys/fs/cgroup/devices/libcontainer_31792_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/libcontainer_31792_systemd_test_default.slice: no such file or directory
Sep 19 09:53:47 ubuntu-xenial kubelet[21878]: W0919 09:53:47.230636 21878 watcher.go:95] Error while processing event ("/sys/fs/cgroup/pids/libcontainer_31792_systemd_test_default.slice": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/pids/libcontainer_31792_systemd_test_default.slice: no such file or directory
^C
root@ubuntu-xenial:~#

@prbakhsh
Copy link

prbakhsh commented Dec 6, 2021

I have the same issue in my bare metal k8s.
k8s version: v1.19.3
cni: weave net

@rchunping
Copy link

I have the same issue
the worker node is running at
FreeBSD 10.4 -> bhyve -> CentOS 8
Hardware: Dell R420 , E5-2470v2 * 2 + 64G

the problem was caused by broken tsc

kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.

upgrade VM host to FreeBSD 11.4 fix the issue.

@yileng146
Copy link

I have the same issue
than I just reboot the problem node to solve the problem
shutdown -r now

@daemonadmin
Copy link

Restarting docker.service resolved this issue for me.

@jpaulcristancho
Copy link

is this resolved? i have the same issue here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests