Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use prometheus conventions for workqueue metrics #71165

Closed
mortent opened this issue Nov 16, 2018 · 8 comments · Fixed by #71300
Closed

Use prometheus conventions for workqueue metrics #71165

mortent opened this issue Nov 16, 2018 · 8 comments · Fixed by #71300
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@mortent
Copy link
Member

mortent commented Nov 16, 2018

What would you like to be added:
Workqueue metrics that follow prometheus best practices and naming conventions. So instead of naming metrics based on the name of the workqueue, create metrics for workqueues that use name as a label. Also use the recommended base units

Currently:

  • daemonset_depth gauge
  • daemonset_queue_latency summary (in milliseconds)
  • daemonset_retries counter
  • daemonset_work_duration summary (in milliseconds)

Proposal:

  • workqueue_depth gauge name="deployment|daemonset|job.."
  • workqueue_queue_latency_seconds summary name="deployment|daemonset|job.." (in seconds)
  • workqueue_retries_total counter name="deployment|daemonset|job.."
  • workqueue_work_duration_seconds summary name="deployment|daemonset|job.." (in seconds)

Why is this needed:
Controllers currently expose prometheus metrics for workqueues. Unfortunately the naming and the use of labels do not line up well with the prometheus recommendations. This leads to several metrics per named workqueue instead of a small number metrics where the name is used as a label. Also, the metrics does not use the recommended base units.

This change will allow for easier scraping of metrics, avoid large number of similarly named metrics and easier aggregation of metrics across different workqueues. The risk with this change is that the current metrics has been in Kubernetes for many releases and there might be users and tooling that depend on the current naming.

/kind feature
/sig api-machinery
/cc @loburm @kow3ns

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 16, 2018
@danielqsj
Copy link
Contributor

+1
I can help if decide to do it.

@lavalamp
Copy link
Member

Definitely no objection to adding a parallel set of metrics.

I think we need to give 2 releases (?) where the existing metrics keep working, since these have been there for ages.

@mortent
Copy link
Member Author

mortent commented Nov 19, 2018

Sounds good.
@danielqsj If you want to tackle this, just assign this issue to yourself. I'm happy to help reviewing when you have something.

@danielqsj
Copy link
Contributor

/assign

@logicalhan
Copy link
Member

I have some reservations about this actually. This effectively circumvents one of the reasons we have metric whitelisting in prometheus-to-sd, specifically we use it to audit the number of metrics ingested (so we can ensure we don't exceed quota). Labelling effectively disguises the growth in cardinality for our metrics, which can potentially be problematic down the line.

@mortent
Copy link
Member Author

mortent commented Dec 7, 2018

I understand the concern, but in this case the number of workqueues will be relatively low (I think we might be 20-30 now?) and we only have one label on the metrics. I would be more concerned if we had a high cardinality label and/or multiple labels.
@loburm You have any guidance here?

@loburm
Copy link
Contributor

loburm commented Dec 7, 2018

Whitelisting in prometheus-to-sd is used for different purpose: telling which metrics we should push to Stackdriver and which not. AFAIK 20-30 different label values is not critical at all for Stackdriver, I have seen worth. So I think it's ok, if that is something controlled in the code, and not autogenerated (for example node name).

Also I don't think that any possible Stackdriver limitations should be considered. Those metrics are incorrectly designed from the Prometheus point of view and we should focus on fixing it.

And last minor point we shouldn't use Summaries. Please replace it by Histogram (that's also recommendation from Prometheus: https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation read at the end). Summaries doesn't allow us to perform any aggregations between different queues and masters (for example in case of HA).

@MikeSpreitzer
Copy link
Member

Yes, PLEASE let's change Summary to Histogram. I care a lot about my DaemonSets, and really need good aggregation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants