New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PLEG is not healthy #61117
Comments
/sig node |
@albertvaka Was this bug fixed, or why was the issue closed? I am seeing the same thing. |
also just had this wonderful experience (with calico 2.6) on one of my nodes in Azure @albertvaka probably abandoned it in favour of #45419 |
@albertvaka out of memory on a node can cause this, and when it happens the node can't update its own node status to say so. You might find the hyperkube process has leaked to use multiple GB of RAM. This might not be your problem but one idea to check. |
In my case (Kubernetes 1.11.1) the problem is not caused by insufficient amount of free memory. |
I am getting same error message on centos 7 node, its has lot of free memory.
|
@grailsweb Did you find a solution? Having the same issue in aks v1.11. |
Kubelet 1.12.3. Have met this issue, not a memory problem. Will try to fit this on an open issue somewhere. |
Has this been fixed? we have the same issue with v1.12.0. Node Memory isn't a problem. |
I have the same issue on K8S v1.10.11 with health CPU, memory and disk. By this way, the issue is fixed temporarily. After restarting docker deamon, there was no container running, meanwhile, kubelet logs kept printing: It gave me light ---- I may delete "/var/lib/kubelet/pods/*". Step1. Stop kubelet Step3. Start kubelet "docker ps" could list the containers. The node came back to the "Ready" state. |
Thanks for this - I lost two nodes over the weekend - flapping. I'll try this next time it happens. I've also brought this up with the folks at OpenEBS just to notify them. They never heard of it. (not an OpenEBS issue but I know some guys over there who are much smarter than me). |
I'm having the exact same issue on IBM Cloud Kubernetes as well. |
It seems to happen to me when I have too many application instances running
and too few nodes. It doesn't matter what the size of the nodes are. I have
a simple 3 node test cluster going. I create one project/namespace and run
one instance of Odoo - all good. I add a few more instances of Odoo and
after a week or so I'm plagued with pleg. My nodes are beefy too. This has
happened on Upcloud, Hetzner and Digital Ocean.
…On Tue, Jun 18, 2019 at 8:08 AM Mohammed S Fadin ***@***.***> wrote:
I'm having the exact same issue on IBM Cloud Kubernetes as well.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61117?email_source=notifications&email_token=AEVXNSGFBNV5VNNCKIWLIW3P3DF5VA5CNFSM4EVD4O72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODX6FDQA#issuecomment-503075264>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEVXNSAHJI5RRB2Y2FZJHD3P3DF5VANCNFSM4EVD4O7Q>
.
|
Is this issue resolved? |
I also got this issue. My cause is memory leak. I don't know how, but collectd got memory leak and eat my memory resources until 20GB and leave no resource for my kubernetes deployment. After check, i got Nodes not ready. |
I see this issue on 1.17.6, with lots of free RAM on the host
|
i got same issue on centos 7.6 with k8s 1.13.12 and docker 18.06.2-ce. it cause node reday/notready intermittently. |
i am also facing the same issue on latest kubelet version which is 1.20.5 and still getting this error, there is no issue related to memory because i have not use any application till yet. |
btw: Where are you hosting your servers?
…On Mon, Apr 12, 2021 at 12:08 PM gautamsoni17990 ***@***.***> wrote:
i am also facing the same issue on latest kubelet version which is 1.20.5
and still getting this error, there is no issue related to memory because i
have not use any application till yet.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSDX3VB7CNYQRQ53FIDTIMLIFANCNFSM4EVD4O7Q>
.
|
Can this be opened or is there a new issue for that? The problem can actually be found on several systems like aws, rancher etc EDIT: #101056 |
"PLEG is not healthy: pleg was last seen active 17m16.107709513s ago; threshold is 3m0s" I am still facing this issue. This is fluctuating. Because of this some pods get stuck. I am also using Cluster Autoscaler, which starts adding node, once pods are not scheduled because of this node in error. Any help or clue? |
Hi:
I generally find this issue arises from network / service mesh inadequacy.
The network is usually the last place people look - we just assume it's
super fast. I've tried a myriad of configurations across many providers,
local clusters etc., and this issue is common across all setups. The only
common denominator in all these setups is the underlying network. Now, when
I set up a K8 provider service like DO or Google Cloud I never run into
this issue (or it's extremely rare). They obviously have the network juice.
So, either DO / Google have a different K8 code base than that in the wild
or they have infrastructure from hell to mitigate these PLEG type errors.
So where does that leave the average bear. Well, that's a tough one. I've
had this issue occur even with killer ping times to my nodes. Something
occasionally gets lost in transit when the cluster is up and running. To
be honest I still think this is a resource issue. For the longest time I
thought it was the Docker container version but I've had the cluster go
south across many different versions of Docker as well.
I know this is not a solution, just many years of observation. My two cents
is to look into good service mesh tech and really focus on the underlying
network.
Cheers,
Dave
…On Fri, Apr 30, 2021 at 12:40 PM Neeraj Swarnkar ***@***.***> wrote:
"PLEG is not healthy: pleg was last seen active 17m16.107709513s ago;
threshold is 3m0s"
I am still facing this issue. This is fluctuating. Because of this some
pods get stuck. I am also using Cluster Autoscaler, which starts adding
node, once pods are not scheduled because of this node in error.
Any help or clue?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSDCYX7BRJNSGSZQ7ITTLLMONANCNFSM4EVD4O7Q>
.
|
I was afraid of that. It looks like we'll need to tweak things via
ConfgMap. Interesting though that a bigger box worked in the past. This is
straight-up a resource/timing issue. The trick is to find which resource is
falling over. Since all the usual suspects look fine, it's likely
the Docker/Kubelet interaction (it pumps out these types of errors). Your
IP pool is large enough so it couldn't be that. I recommend trying to get a
dump of the current kubelet config outlined in the second link I sent.
Hopefully you have enough permission for that.
…On Wed, May 5, 2021 at 3:34 PM Neeraj Swarnkar ***@***.***> wrote:
https://www.ibm.com/docs/en/cloud-paks/cp-management/1.1.0?topic=mmcpt-slow-interaction-between-kubelet-docker-cause-pleg-issues
I could not try this, as I think it needs node root access to do it, which
I dont have now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSA7QHSXCT75XY3YRGTTMGMS5ANCNFSM4EVD4O7Q>
.
|
Just a note: Two chart items you should be looking at in your Netdata
PLEG Relisting Interval Summary in microseconds
PLEG Relisting Latency Summary in microseconds
To check your Kubelet metrics you can go to: https://127.0.0.1:10250/metrics
POD Relist needs to complete in 3 minutes or it barfs.
There may be too many pods on the node for relist to complete in 3 minutes.
The number of events and latency is proportional to the number of pods,
independent of node resources.
…On Wed, May 5, 2021 at 3:48 PM Dave Cook ***@***.***> wrote:
I was afraid of that. It looks like we'll need to tweak things via
ConfgMap. Interesting though that a bigger box worked in the past. This is
straight-up a resource/timing issue. The trick is to find which resource is
falling over. Since all the usual suspects look fine, it's likely
the Docker/Kubelet interaction (it pumps out these types of errors). Your
IP pool is large enough so it couldn't be that. I recommend trying to get a
dump of the current kubelet config outlined in the second link I sent.
Hopefully you have enough permission for that.
On Wed, May 5, 2021 at 3:34 PM Neeraj Swarnkar ***@***.***>
wrote:
>
> https://www.ibm.com/docs/en/cloud-paks/cp-management/1.1.0?topic=mmcpt-slow-interaction-between-kubelet-docker-cause-pleg-issues
> I could not try this, as I think it needs node root access to do it,
> which I dont have now.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#61117 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AEVXNSA7QHSXCT75XY3YRGTTMGMS5ANCNFSM4EVD4O7Q>
> .
>
|
I will check, But I am in complete agreement with your assessment. Because when PLEG issue comes, I saw the pods, though working perfectly, have their status "unknown". The moment, I remove the stuck pod, they status becomes "green" i.e. active. |
I tried this command: It returns null. I checked kubectl version
|
Try adding -n <namespace name> to the command. Namespace is probably
kube-system
…On Wed, May 5, 2021 at 5:46 PM Neeraj Swarnkar ***@***.***> wrote:
I tried this command:
kubectl get no ${NODE_NAME} -o json | jq '.status.config'
It returns null.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSDEZ65DKZY6QU3V2B3TMG4DTANCNFSM4EVD4O7Q>
.
|
I saw the logs of metric server and found line like: |
Interesting - what network plugin are you using? (Canal / Calico or other)
…On Wed, May 5, 2021 at 6:10 PM Neeraj Swarnkar ***@***.***> wrote:
I saw the logs of metric server and found line like:
E0503 18:34:06.511772 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 18:35:06.507445 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 18:36:06.507504 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 18:37:06.507423 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 18:38:06.507423 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 18:39:06.507403 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 18:39:48.455646 1 reststorage.go:135] unable to fetch node metrics
for node "ip-10-0-32-228.ap-northeast-1.compute.internal": no metrics known
for node
E0503 18:40:06.507445 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
E0503 19:22:06.507434 1 manager.go:111] unable to fully collect metrics:
unable to fully scrape metrics from source
kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
(10.0.32.228): Get
https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true: context
deadline exceeded
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSF25LPCFMASHZSUXKLTMG63ZANCNFSM4EVD4O7Q>
.
|
Can you - curl https://10.0.32.228:10250/metrics ? Could be DNS or more
likely the scrape_timeout needs to be increased. Know how to do it in
Prometheus not sure in Netdata.
…On Wed, May 5, 2021 at 6:16 PM Dave Cook ***@***.***> wrote:
Interesting - what network plugin are you using? (Canal / Calico or other)
On Wed, May 5, 2021 at 6:10 PM Neeraj Swarnkar ***@***.***>
wrote:
> I saw the logs of metric server and found line like:
> E0503 18:34:06.511772 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 18:35:06.507445 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 18:36:06.507504 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 18:37:06.507423 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 18:38:06.507423 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 18:39:06.507403 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 18:39:48.455646 1 reststorage.go:135] unable to fetch node metrics
> for node "ip-10-0-32-228.ap-northeast-1.compute.internal": no metrics known
> for node
> E0503 18:40:06.507445 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
> E0503 19:22:06.507434 1 manager.go:111] unable to fully collect metrics:
> unable to fully scrape metrics from source
> kubelet_summary:ip-10-0-32-228.ap-northeast-1.compute.internal: unable to
> fetch metrics from Kubelet ip-10-0-32-228.ap-northeast-1.compute.internal
> (10.0.32.228): Get
> https://10.0.32.228:10250/stats/summary?only_cpu_and_memory=true:
> context deadline exceeded
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#61117 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AEVXNSF25LPCFMASHZSUXKLTMG63ZANCNFSM4EVD4O7Q>
> .
>
|
I had a similar problem with AKS, where the node state switched between Ready and Not Ready. Pleg is not healthy, but the CPU memory consumption is not high. I contacted the Ask product team and they couldn't find Root Causel either. Since our developers frequently create new NS on the cluster and deploy a series of services in it for inheritance, the NS will be deleted at the end. This can happen during deployment. |
Thanks, by NS are you referring to namespaces?
…On Thu, May 6, 2021 at 5:54 AM AdamJin ***@***.***> wrote:
I had a similar problem with AKS, where the node state switched between
Ready and Not Ready. Pleg is not healthy, but the CPU memory consumption is
not high. I contacted the Ask product team and they couldn't find Root
Causel either. Since our developers frequently create new NS on the cluster
and deploy a series of services in it for inheritance, the NS will be
deleted at the end. This can happen during deployment.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSFBIVCLE2QLDISP2LDTMJRLZANCNFSM4EVD4O7Q>
.
|
Same output i.e. null |
I can call curl -k https://10.0.32.228:10250/metrics It said unauthorized. |
� |
I found pods running in kube-system: calico-kube-controllers which CNI plugin is this, not sure, I could see both pods installed. � |
As per the logs, I checked canal is being used on all worker and master node. |
OK thanks, looks like a permission thing on /metrics. Here's something to
look at: kubernetes-sigs/metrics-server#167 There
are some switches that can be sent to kubernetes metrics server.
…On Thu, May 6, 2021 at 2:46 PM Neeraj Swarnkar ***@***.***> wrote:
As per the logs, I checked canal is being used on all nodes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSFTRRDC5JPYDH4PK63TMLP2FANCNFSM4EVD4O7Q>
.
|
Status of my metric server is all passed. |
Same configuration is being used by my metrics serves( same switches) |
https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml |
Actually, the metrics server config looks OK. Your setup looks fine. I also have setups like this with no problems. None on Amazon. I think we may need to get them involved in this issue. In your case, it appears it might be AWS specific. You may need a higher access level to perform the things we need to do for diagnostics. As I mentioned earlier, creating PODs incurs overhead (irrespective of resources RAM etc.). Mostly events and latency overhead. It's basically this overhead that's causing your PLEG issues when creating many pods on the cluster. In order to investigate this we need to fine tune things at a lower level, which I believe you don't have the access level to do. This is common with cloud providers as they don't really let you get under the hood. We have this issue a lot with AWS RDS. If you're budget is constrained I would dig into this deeper with AWS (I think we narrowed down the things they can check in this thread), otherwise I would just increase the server size of your setup. |
Yes, actually to keep our budget low, we chose such machines. Now I can access the machines with higher access level. What set of debugging information we can collect? |
Great! Our original goal was to grab a dump of the current kubelet config
so that we can possibly change some parameters to extend the relist timeout
via a Configmap.
…On Fri, May 7, 2021 at 2:27 PM Neeraj Swarnkar ***@***.***> wrote:
Yes, actually to keep our budget low, we chose such machines. Now I can
access the machines with higher access level. What set of debugging
information we can collect?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#61117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVXNSCAXTJMGLRVBQNV76TTMQWJZANCNFSM4EVD4O7Q>
.
|
PLEG not healthy's real reason is not runc lead to,list container from dockerd, dockerd not comunite with runc, not healthy because last succeeded list container's time to now more than 3minute, PLEG list container from docker daemon by labels: all=1&filters={"label":{"io.kubernetes.docker.type=container":true},"status":{"running":true}}, and the docker damon first query containers from containerd, and containerd read go-memdb(memory db) and return containers to daemon, and daemon use the containers to |
the list api server of dockerd call link: |
is this issue fixed in kubernetes version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"} i can still see this issue-- Conditions: NetworkUnavailable False Sun, 12 Sep 2021 16:04:37 +0000 Sun, 12 Sep 2021 16:04:37 +0000 CalicoIsUp Calico is running on this node root@ubuntu-xenial:~# systemctl status kubelet lines 1-23/23 (END) Sep 19 09:53:47 ubuntu-xenial systemd[1]: Removed slice libcontainer_31792_systemd_test_default.slice. |
I have the same issue in my bare metal k8s. |
I have the same issue the problem was caused by broken tsc
upgrade VM host to FreeBSD 11.4 fix the issue. |
I have the same issue |
Restarting docker.service resolved this issue for me. |
is this resolved? i have the same issue here |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
I had a node missbehaving. journal of the kubelet showed this:
Environment:
kubectl version
): 1.8.8uname -a
): 4.13.0-1011-gcpThe text was updated successfully, but these errors were encountered: