Skip to content

How to read metrics kubectl top nodes/pods? #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Jeeppler opened this issue Dec 26, 2018 · 13 comments
Closed

How to read metrics kubectl top nodes/pods? #193

Jeeppler opened this issue Dec 26, 2018 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Jeeppler
Copy link

How to read the metrics created by kubectl top nodes and kubectl top pods --all-namespaces?

Environment

I am running an AWS EKS cluster on a single EC2 t3.large instance (VM). The instance has 2 vCPU and 8 GiB Memory. vCPUs are virtual CPUs. T3 instances can burst for some time above the baseline. Meaning, they use a credit based scheduler and get more physical CPU time if needed.

kubectl top nodes

I run the kubectl top nodes command three times in a row and received three different outputs:

$ kubectl top nodes                     
NAME                          CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
ip-10-10-1-247.ec2.internal   38m          1%        410Mi           5%        
$ kubectl top nodes
NAME                          CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
ip-10-10-1-247.ec2.internal   40m          2%        411Mi           5%        
$ kubectl top nodes
NAME                          CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
ip-10-10-1-247.ec2.internal   39m          1%        411Mi           5% 
  1. I do not understand is what does the CPU(cores) 38m, 40m, 39m actually refer to? What is the unit m? The aspect I do not understand is, I am using 1-2% of CPU. However, percent is a relative unit. What is a 100% CPU?
  2. The same question can be asked for Memory%, what is 100% memory?

kubectl top pods --all-namespaces

Different from the kubectl top nodes the kubectl top pods --all-namespaces actually produces the same output. However, there is not much going on in my Kubernetes cluster. The following is a sample output:

$ kubectl top pods --all-namespaces
NAMESPACE     NAME                                                              CPU(cores)   MEMORY(bytes)   
kube-system   alb-ingress-controller-aws-alb-ingress-controller-67d7cf85lwdg2   3m           10Mi            
kube-system   aws-node-9nmnw                                                    2m           20Mi            
kube-system   coredns-7bcbfc4774-q4pjj                                          2m           7Mi             
kube-system   coredns-7bcbfc4774-wwlcr                                          2m           7Mi             
kube-system   external-dns-54df666786-2ld9w                                     1m           12Mi            
kube-system   kube-proxy-ss87v                                                  2m           10Mi            
kube-system   kubernetes-dashboard-5478c45897-fcm48                             1m           12Mi            
kube-system   metrics-server-5f64dbfb9d-fnk5r                                   1m           12Mi            
kube-system   tiller-deploy-85744d9bfb-64pcr                                    1m           29Mi  
  1. What does the CPU(Cores) section represent here? What does the unit m stand for?
  2. What does the MEMORY(bytes) actually mean? Is this how much memory the pod container uses? However, kubectl top nodes, shows me a MEMORY(bytes) usage of approx. 410Mi. But if I add all the MEMORY(bytes) of the pods together I end up at 10+20+7+7+12+10+13+12+29 = 120Mi. Where are the other 410-120 = 290Mi used?
@DirectXMan12
Copy link
Contributor

ok, so on the subject of milliunits: https://github.com/DirectXMan12/k8s-prometheus-adapter/blob/master/docs/walkthrough.md#quantity-values or https://github.com/kubernetes-incubator/custom-metrics-apiserver/blob/master/docs/getting-started.md#quantities explains.

on the subject of what the values actually mean:

  • cores are a meaurement of CPU time over time (which sounds weird, but it's not actually). Normal CPU usage from the OS is measured in cumulative CPU time used. As your process runs actively, it accumulates CPU time used. When it's not running, or waiting on something else, it doesn't use CPU time. We take the rate of change of that cumulative usage, and call it "cores", because 1 second of CPU usage used over 1 second of real time means that the CPU dedicated an entire core of the CPU to your process for that 1 second interval.

  • memory is the sum of the memory usage for the containers in your pod, for pods. For nodes, I believe it's the memory usage as reported by the system-wide node cgroup. So it may include stuff that's not in a pod, IIRC. We just report the information given to us by cadvisor on the node.

@Jeeppler
Copy link
Author

Jeeppler commented Jan 4, 2019

@DirectXMan12 thanks for the explanation, that makes more sense now. However, I would be happy if the documentation for metrics server could contain a description of the units.

@DirectXMan12
Copy link
Contributor

I'd be happy to accept a PR to update the docs :-).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 28, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 28, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@blaise-sumo
Copy link

I'd be happy to accept a PR to update the docs :-).

@DirectXMan12 , I would like to update the docs, I looked in this repo and also in https://github.com/DirectXMan12/k8s-prometheus-adapter/tree/master/docs but I'm not sure if that is the correct place.

@blaisep
Copy link

blaisep commented Nov 27, 2019

Since this thread appears in search results, the docs seem to be updated at:
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units

@wu105
Copy link

wu105 commented Feb 23, 2020

I can understand the concept of units, requests, limits and actuals, but have a hard time to reconcile the 'kubectl top' output with that of "traditional linux commands", e.g., uptime (load everages of past 1, 5, and 15 minutes), free (memory used, free, shared, buff/cache, available), top ('uptime' + 'free -k', then VIRT/RES/SHR memory and %CPU/%MEM on processes).

One question on the kubectl top commands is for what time period it is reporting? When samples are taken and what samples are rolled up for the output? On the other hand, Linux commands always report the state at the time the command is executed, and they include non-kubernetes processes.

The understanding would have impact on managing the resource utilization and the stability of kubernetes. We have random node crashes, suspecting resource crunch, but the utilization numbers are usually low. Kubernetes require us to turn off swapping, which can be a contributing factor because no swapping means no wiggle room in memory overruns, and wasted inactive contents occupying memory.

@serathius
Copy link
Contributor

serathius commented Feb 24, 2020

Hey @wu105,
kubectl top was not designed to be standalone tool to debug resource usage or be replacement to traditional linux command.
It's a additional tool for core metrics pipeline.

Main purpose of metrics pipeline is delivering metrics for autoscaling purposes. Values exposed are curated to autoscaling and OOM detection. Building autoscaling pipeline that can scale to thousands of nodes has different requirements than collecting accurate resource usage that you talk about.

Please take a look at new FAQ section for metrics server. https://github.com/kubernetes-sigs/metrics-server
I would be happy to answer any additional questions, but please create a separate issue for it. Writing on old issues makes it harder for other people to discover this thread.

@serathius
Copy link
Contributor

I would encourage you to look into monitoring systems like Prometheus that are closer to what your describing.

@wu105
Copy link

wu105 commented Feb 24, 2020

Submitted #447

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants