-
Notifications
You must be signed in to change notification settings - Fork 40.5k
Add cadvisor machine metrics #95210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cadvisor machine metrics #95210
Conversation
Hi @erikwilson. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: erikwilson The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding this!
/test pull-kubernetes-node-e2e |
I've tried a build of v1.19.2 with this fix, but the I'm not sure why the machine info metrics needed to be timestamped, but this seems to breaks how the kubelet cached them until now. Unless the Prometheus config disables |
Thanks for pointing that out @basvdlei. It looks like that is run as part of the manager, still trying to grok why this code is different or how it relates to other cadvisor code in the kublet, like here:
|
I was checking that has not been resolved in the 1.19.3 release. Due PR is not ready. So I had to stay with kubelet 1.18.9 and kubeadm and kubectl 1.19.3. Is there any better workaround for this? Or in this case running cadvisor as a separate ds?. I don't know if this last be a good practice. |
/assign @bobbypage |
A tricky fix is change cached machine info timestamp when get it: kubernetes/pkg/kubelet/kubelet_getters.go Lines 383 to 388 in 425fb7e
To: // GetCachedMachineInfo assumes that the machine info can't change without a reboot
func (kl *Kubelet) GetCachedMachineInfo() (*cadvisorapiv1.MachineInfo, error) {
kl.machineInfoLock.RLock()
defer kl.machineInfoLock.RUnlock()
clone := kl.machineInfo.Clone()
clone.Timestamp = time.Now() //
return clone, nil
} |
@@ -357,6 +357,7 @@ func (s *Server) InstallDefaultHandlers(enableCAdvisorJSONEndpoints bool) { | |||
cadvisormetrics.NetworkUsageMetrics: struct{}{}, | |||
cadvisormetrics.AppMetrics: struct{}{}, | |||
cadvisormetrics.ProcessMetrics: struct{}{}, | |||
cadvisormetrics.CPUTopologyMetrics: struct{}{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding metrics that weren't originally present, right? Since we should cherrypick this change, we shouldn't introduce new metrics here
@basvdlei Do you mind to try this PR with a fix in this comment ? I haven't tested it yet, but I believe it should solve the problem you described. |
@lingsamuel sorry for the late response. I made another build with your patch and as far as I can tell that works. At least for the metrics I use. Updating the timestamp on scrape should prevent the metrics from becoming stale. I was thinking about this though, and it does feel a little bit as a work-around. Since the timestamps have no meaning on the cached metrics, I'm wondering if there is some way we can drop them altogether. |
Yes, it definitely is a temporary workaround, and I am not sure if it can be merged into k8s codebase. |
I took a little bit more time to look into this. The collector will only include a timestamp when it's non-zero. So it's just a small fix to get the old behavior back, by setting the timestamp to zero before caching the machineInfo: diff --git a/pkg/kubelet/kubelet.go b/pkg/kubelet/kubelet.go
index 2fdd321e583..7ccb28daad2 100644
--- a/pkg/kubelet/kubelet.go
+++ b/pkg/kubelet/kubelet.go
@@ -561,6 +561,7 @@ func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
if err != nil {
return nil, err
}
+ machineInfo.Timestamp = time.Time{}
klet.setCachedMachineInfo(machineInfo)
imageBackOff := flowcontrol.NewBackOff(backOffPeriod, MaxContainerBackOff) |
Thanks for making this change. This issue has caused the node metric collection for Azure Monitor for Containers to be broken which also impacts alerting. When can we expect the fix to be backported and available? |
Im agree. We are in 1.20 and I can't upgrade because of this. The worst thing is that suddenly disappeared in 1.19 with no warning and or release notes mention. And huge impact with 3rd party utilities and platforms. So should we use the another alternatives. Or this will be resolved?, regards |
It is possible to handle this with some urgency now? As already mentioned, Azure Monitor 'insights.container/nodes' metrics are broken for AKS 1.19.X+ because of this. Alerting is broken as well, and it blocks us from upgrading our eastus2 production AKS cluster. Thanks! |
@kasbst check out #95204 (comment) |
since #97006 is merged and issue was closed, let's close this PR. Thank you! /close Please re-open if you want the part exposing |
@SergeyKanzhelev: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Adds prometheus cadvisor machine metrics back into mainline k8s (should be backported to 1.19)
Which issue(s) this PR fixes:
Fixes #95204
Special notes for your reviewer:
Assumes we want
CPUTopologyMetrics
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: