Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cadvisor machine metrics #95210

Closed
wants to merge 1 commit into from
Closed

Add cadvisor machine metrics #95210

wants to merge 1 commit into from

Conversation

erikwilson
Copy link
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:
Adds prometheus cadvisor machine metrics back into mainline k8s (should be backported to 1.19)

Which issue(s) this PR fixes:

Fixes #95204

Special notes for your reviewer:
Assumes we want CPUTopologyMetrics

Does this PR introduce a user-facing change?:

Provide cadvisor machine metrics

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 30, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @erikwilson. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 30, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: erikwilson
To complete the pull request process, please assign cheftako after the PR has been reviewed.
You can assign the PR to them by writing /assign @cheftako in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shubheksha
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 2, 2020
Copy link
Contributor

@shubheksha shubheksha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding this!

@erikwilson
Copy link
Contributor Author

/test pull-kubernetes-node-e2e

@basvdlei
Copy link

I've tried a build of v1.19.2 with this fix, but the machine_ metrics become stale and disappear from Prometheus since they are now timestamped: google/cadvisor@738f136 The kubelet only collects these metrics once and assumes a reboot for any updates. While cadvisor's manager will start a goroutine that updates them every 5min (configurable).

I'm not sure why the machine info metrics needed to be timestamped, but this seems to breaks how the kubelet cached them until now. Unless the Prometheus config disables honor_timestamps for this specific scrape job.

@erikwilson
Copy link
Contributor Author

Thanks for pointing that out @basvdlei. It looks like that is run as part of the manager, still trying to grok why this code is different or how it relates to other cadvisor code in the kublet, like here:

return cc.Manager.Start()

@luisusr
Copy link

luisusr commented Oct 15, 2020

I was checking that has not been resolved in the 1.19.3 release. Due PR is not ready. So I had to stay with kubelet 1.18.9 and kubeadm and kubectl 1.19.3. Is there any better workaround for this? Or in this case running cadvisor as a separate ds?. I don't know if this last be a good practice.

@SergeyKanzhelev
Copy link
Member

/assign @bobbypage

@lingsamuel
Copy link
Contributor

A tricky fix is change cached machine info timestamp when get it:

// GetCachedMachineInfo assumes that the machine info can't change without a reboot
func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) {
kl.machineInfoLock.RLock()
defer kl.machineInfoLock.RUnlock()
return kl.machineInfo, nil
}

To:

// GetCachedMachineInfo assumes that the machine info can't change without a reboot
func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) {
	kl.machineInfoLock.RLock()
	defer kl.machineInfoLock.RUnlock()
	clone := kl.machineInfo.Clone()
	clone.Timestamp = time.Now() //
	return clone, nil
}

@@ -357,6 +357,7 @@ func (s *Server) InstallDefaultHandlers(enableCAdvisorJSONEndpoints bool) {
cadvisormetrics.NetworkUsageMetrics: struct{}{},
cadvisormetrics.AppMetrics: struct{}{},
cadvisormetrics.ProcessMetrics: struct{}{},
cadvisormetrics.CPUTopologyMetrics: struct{}{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding metrics that weren't originally present, right? Since we should cherrypick this change, we shouldn't introduce new metrics here

@lingsamuel
Copy link
Contributor

@basvdlei Do you mind to try this PR with a fix in this comment ? I haven't tested it yet, but I believe it should solve the problem you described.

@basvdlei
Copy link

@basvdlei Do you mind to try this PR with a fix in this comment ? I haven't tested it yet, but I believe it should solve the problem you described.

@lingsamuel sorry for the late response. I made another build with your patch and as far as I can tell that works. At least for the metrics I use. Updating the timestamp on scrape should prevent the metrics from becoming stale.

I was thinking about this though, and it does feel a little bit as a work-around. Since the timestamps have no meaning on the cached metrics, I'm wondering if there is some way we can drop them altogether.

@lingsamuel
Copy link
Contributor

Yes, it definitely is a temporary workaround, and I am not sure if it can be merged into k8s codebase.
I think the timestamp is a cadvisor design, maybe should open a issue in its repo.

@basvdlei
Copy link

I took a little bit more time to look into this. The collector will only include a timestamp when it's non-zero. So it's just a small fix to get the old behavior back, by setting the timestamp to zero before caching the machineInfo:

diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
index 2fdd321e583..7ccb28daad2 100644
--- a/pkg/kubelet/kubelet.go
+++ b/pkg/kubelet/kubelet.go
@@ -561,6 +561,7 @@ func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
        if err != nil {
                return nil, err
        }
+       machineInfo.Timestamp = time.Time{}
        klet.setCachedMachineInfo(machineInfo)
 
        imageBackOff := flowcontrol.NewBackOff(backOffPeriod, MaxContainerBackOff)

@rashmichandrashekar
Copy link

rashmichandrashekar commented Dec 1, 2020

Thanks for making this change. This issue has caused the node metric collection for Azure Monitor for Containers to be broken which also impacts alerting. When can we expect the fix to be backported and available?

@luisusr
Copy link

luisusr commented Dec 16, 2020

Thanks for making this change. This issue has caused the node metric collection for Azure Monitor for Containers to be broken which also impacts alerting. When can we expect the fix to be backported and available?

Im agree. We are in 1.20 and I can't upgrade because of this. The worst thing is that suddenly disappeared in 1.19 with no warning and or release notes mention. And huge impact with 3rd party utilities and platforms. So should we use the another alternatives. Or this will be resolved?, regards

@kasbst
Copy link

kasbst commented Dec 31, 2020

It is possible to handle this with some urgency now? As already mentioned, Azure Monitor 'insights.container/nodes' metrics are broken for AKS 1.19.X+ because of this. Alerting is broken as well, and it blocks us from upgrading our eastus2 production AKS cluster. Thanks!

@frittentheke
Copy link

@kasbst check out #95204 (comment)

@SergeyKanzhelev
Copy link
Member

since #97006 is merged and issue was closed, let's close this PR. Thank you!

/close

Please re-open if you want the part exposing CPUTopologyMetrics.

@k8s-ci-robot
Copy link
Contributor

@SergeyKanzhelev: Closed this PR.

In response to this:

since #97006 is merged and issue was closed, let's close this PR. Thank you!

/close

Please re-open if you want the part exposing CPUTopologyMetrics.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing cadvisor machine metrics (v1.19)