Add cadvisor machine metrics #95210

erikwilson · 2020-09-30T23:25:54Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Adds prometheus cadvisor machine metrics back into mainline k8s (should be backported to 1.19)

Which issue(s) this PR fixes:

Fixes #95204

Special notes for your reviewer:
Assumes we want CPUTopologyMetrics

Does this PR introduce a user-facing change?:

Provide cadvisor machine metrics

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2020-09-30T23:26:02Z

Hi @erikwilson. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-09-30T23:27:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erikwilson
To complete the pull request process, please assign cheftako after the PR has been reviewed.
You can assign the PR to them by writing /assign @cheftako in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/server/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shubheksha · 2020-10-02T16:51:30Z

/ok-to-test

shubheksha

LGTM, thanks for adding this!

erikwilson · 2020-10-06T19:44:24Z

/test pull-kubernetes-node-e2e

basvdlei · 2020-10-12T16:31:41Z

I've tried a build of v1.19.2 with this fix, but the machine_ metrics become stale and disappear from Prometheus since they are now timestamped: google/cadvisor@738f136 The kubelet only collects these metrics once and assumes a reboot for any updates. While cadvisor's manager will start a goroutine that updates them every 5min (configurable).

I'm not sure why the machine info metrics needed to be timestamped, but this seems to breaks how the kubelet cached them until now. Unless the Prometheus config disables honor_timestamps for this specific scrape job.

erikwilson · 2020-10-15T00:39:19Z

Thanks for pointing that out @basvdlei. It looks like that is run as part of the manager, still trying to grok why this code is different or how it relates to other cadvisor code in the kublet, like here:

kubernetes/pkg/kubelet/cadvisor/cadvisor_linux.go

Line 135 in 5d6dc8d

return cc.Manager.Start()

luisusr · 2020-10-15T11:39:58Z

I was checking that has not been resolved in the 1.19.3 release. Due PR is not ready. So I had to stay with kubelet 1.18.9 and kubeadm and kubectl 1.19.3. Is there any better workaround for this? Or in this case running cadvisor as a separate ds?. I don't know if this last be a good practice.

SergeyKanzhelev · 2020-10-21T18:23:28Z

/assign @bobbypage

lingsamuel · 2020-11-06T10:06:58Z

A tricky fix is change cached machine info timestamp when get it:

kubernetes/pkg/kubelet/kubelet_getters.go

Lines 383 to 388 in 425fb7e

    
           // GetCachedMachineInfo assumes that the machine info can't change without a reboot 
        
           func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) { 
        
           	kl.machineInfoLock.RLock() 
        
           	defer kl.machineInfoLock.RUnlock() 
        
           	return kl.machineInfo, nil 
        
           }

To:

// GetCachedMachineInfo assumes that the machine info can't change without a reboot
func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) {
	kl.machineInfoLock.RLock()
	defer kl.machineInfoLock.RUnlock()
	clone := kl.machineInfo.Clone()
	clone.Timestamp = time.Now() //
	return clone, nil
}

dashpole · 2020-11-06T17:12:38Z

pkg/kubelet/server/server.go

@@ -357,6 +357,7 @@ func (s *Server) InstallDefaultHandlers(enableCAdvisorJSONEndpoints bool) {
 		cadvisormetrics.NetworkUsageMetrics: struct{}{},
 		cadvisormetrics.AppMetrics:          struct{}{},
 		cadvisormetrics.ProcessMetrics:      struct{}{},
+		cadvisormetrics.CPUTopologyMetrics:  struct{}{},


This is adding metrics that weren't originally present, right? Since we should cherrypick this change, we shouldn't introduce new metrics here

lingsamuel · 2020-11-09T01:44:07Z

@basvdlei Do you mind to try this PR with a fix in this comment ? I haven't tested it yet, but I believe it should solve the problem you described.

basvdlei · 2020-11-11T21:06:03Z

@basvdlei Do you mind to try this PR with a fix in this comment ? I haven't tested it yet, but I believe it should solve the problem you described.

@lingsamuel sorry for the late response. I made another build with your patch and as far as I can tell that works. At least for the metrics I use. Updating the timestamp on scrape should prevent the metrics from becoming stale.

I was thinking about this though, and it does feel a little bit as a work-around. Since the timestamps have no meaning on the cached metrics, I'm wondering if there is some way we can drop them altogether.

lingsamuel · 2020-11-12T05:46:53Z

Yes, it definitely is a temporary workaround, and I am not sure if it can be merged into k8s codebase.
I think the timestamp is a cadvisor design, maybe should open a issue in its repo.

basvdlei · 2020-11-12T15:19:42Z

I took a little bit more time to look into this. The collector will only include a timestamp when it's non-zero. So it's just a small fix to get the old behavior back, by setting the timestamp to zero before caching the machineInfo:

diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
index 2fdd321e583..7ccb28daad2 100644
--- a/pkg/kubelet/kubelet.go
+++ b/pkg/kubelet/kubelet.go
@@ -561,6 +561,7 @@ func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
        if err != nil {
                return nil, err
        }
+       machineInfo.Timestamp = time.Time{}
        klet.setCachedMachineInfo(machineInfo)
 
        imageBackOff := flowcontrol.NewBackOff(backOffPeriod, MaxContainerBackOff)

rashmichandrashekar · 2020-12-01T19:14:06Z

Thanks for making this change. This issue has caused the node metric collection for Azure Monitor for Containers to be broken which also impacts alerting. When can we expect the fix to be backported and available?

luisusr · 2020-12-16T21:32:48Z

Thanks for making this change. This issue has caused the node metric collection for Azure Monitor for Containers to be broken which also impacts alerting. When can we expect the fix to be backported and available?

Im agree. We are in 1.20 and I can't upgrade because of this. The worst thing is that suddenly disappeared in 1.19 with no warning and or release notes mention. And huge impact with 3rd party utilities and platforms. So should we use the another alternatives. Or this will be resolved?, regards

kasbst · 2020-12-31T16:23:05Z

It is possible to handle this with some urgency now? As already mentioned, Azure Monitor 'insights.container/nodes' metrics are broken for AKS 1.19.X+ because of this. Alerting is broken as well, and it blocks us from upgrading our eastus2 production AKS cluster. Thanks!

frittentheke · 2021-01-05T12:49:48Z

@kasbst check out #95204 (comment)

SergeyKanzhelev · 2021-01-05T17:51:01Z

since #97006 is merged and issue was closed, let's close this PR. Thank you!

/close

Please re-open if you want the part exposing CPUTopologyMetrics.

k8s-ci-robot · 2021-01-05T17:51:17Z

@SergeyKanzhelev: Closed this PR.

In response to this:

since #97006 is merged and issue was closed, let's close this PR. Thank you!

/close

Please re-open if you want the part exposing CPUTopologyMetrics.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add cadvisor machine metrics

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23
Expired

Verified
Learn about vigilant mode

e7150ae

k8s-ci-robot added release-note kind/bug size/XS cncf-cla: yes needs-sig labels Sep 30, 2020

k8s-ci-robot added needs-priority needs-ok-to-test area/kubelet sig/node and removed needs-sig labels Sep 30, 2020

k8s-ci-robot requested review from mattjmcnaughton and yujuhong September 30, 2020 23:27

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Oct 2, 2020

shubheksha approved these changes Oct 2, 2020

View reviewed changes

k8s-ci-robot assigned bobbypage Oct 21, 2020

cjellick mentioned this pull request Nov 6, 2020

Extending Prometheus metrics by hardware metrics google/cadvisor#2444

Merged

dashpole reviewed Nov 6, 2020

View reviewed changes

Creatone mentioned this pull request Nov 27, 2020

cadvisor can not provide prometheus hardware metrics in k8s google/cadvisor#2740

Open

lingsamuel mentioned this pull request Dec 2, 2020

Fix missing cadvisor machine metrics #97006

Merged

k8s-ci-robot closed this Jan 5, 2021

tonyzaizai mentioned this pull request Sep 5, 2022

fix: Ignore the cache timestamp of the MachineInfo Metrics kubeedge/kubeedge#4160

Merged

Add cadvisor machine metrics #95210

Add cadvisor machine metrics #95210

Conversation

erikwilson commented Sep 30, 2020

Uh oh!

k8s-ci-robot commented Sep 30, 2020

Uh oh!

k8s-ci-robot commented Sep 30, 2020

Uh oh!

shubheksha commented Oct 2, 2020

Uh oh!

shubheksha left a comment

Choose a reason for hiding this comment

Uh oh!

erikwilson commented Oct 6, 2020

Uh oh!

basvdlei commented Oct 12, 2020

Uh oh!

erikwilson commented Oct 15, 2020

Uh oh!

luisusr commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SergeyKanzhelev commented Oct 21, 2020

Uh oh!

lingsamuel commented Nov 6, 2020

Uh oh!

dashpole Nov 6, 2020

Choose a reason for hiding this comment

Uh oh!

lingsamuel commented Nov 9, 2020

Uh oh!

basvdlei commented Nov 11, 2020

Uh oh!

lingsamuel commented Nov 12, 2020

Uh oh!

basvdlei commented Nov 12, 2020

Uh oh!

rashmichandrashekar commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luisusr commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasbst commented Dec 31, 2020

Uh oh!

frittentheke commented Jan 5, 2021

Uh oh!

SergeyKanzhelev commented Jan 5, 2021

Uh oh!

k8s-ci-robot commented Jan 5, 2021

Uh oh!

luisusr commented Oct 15, 2020 •

edited

Loading

rashmichandrashekar commented Dec 1, 2020 •

edited

Loading

luisusr commented Dec 16, 2020 •

edited

Loading