New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use prometheus conventions for workqueue metrics #71165
Comments
+1 |
Definitely no objection to adding a parallel set of metrics. I think we need to give 2 releases (?) where the existing metrics keep working, since these have been there for ages. |
Sounds good. |
/assign |
I have some reservations about this actually. This effectively circumvents one of the reasons we have metric whitelisting in prometheus-to-sd, specifically we use it to audit the number of metrics ingested (so we can ensure we don't exceed quota). Labelling effectively disguises the growth in cardinality for our metrics, which can potentially be problematic down the line. |
I understand the concern, but in this case the number of workqueues will be relatively low (I think we might be 20-30 now?) and we only have one label on the metrics. I would be more concerned if we had a high cardinality label and/or multiple labels. |
Whitelisting in prometheus-to-sd is used for different purpose: telling which metrics we should push to Stackdriver and which not. AFAIK 20-30 different label values is not critical at all for Stackdriver, I have seen worth. So I think it's ok, if that is something controlled in the code, and not autogenerated (for example node name). Also I don't think that any possible Stackdriver limitations should be considered. Those metrics are incorrectly designed from the Prometheus point of view and we should focus on fixing it. And last minor point we shouldn't use Summaries. Please replace it by Histogram (that's also recommendation from Prometheus: https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation read at the end). Summaries doesn't allow us to perform any aggregations between different queues and masters (for example in case of HA). |
Yes, PLEASE let's change Summary to Histogram. I care a lot about my DaemonSets, and really need good aggregation. |
What would you like to be added:
Workqueue metrics that follow prometheus best practices and naming conventions. So instead of naming metrics based on the name of the workqueue, create metrics for workqueues that use name as a label. Also use the recommended base units
Currently:
Proposal:
Why is this needed:
Controllers currently expose prometheus metrics for workqueues. Unfortunately the naming and the use of labels do not line up well with the prometheus recommendations. This leads to several metrics per named workqueue instead of a small number metrics where the name is used as a label. Also, the metrics does not use the recommended base units.
This change will allow for easier scraping of metrics, avoid large number of similarly named metrics and easier aggregation of metrics across different workqueues. The risk with this change is that the current metrics has been in Kubernetes for many releases and there might be users and tooling that depend on the current naming.
/kind feature
/sig api-machinery
/cc @loburm @kow3ns
The text was updated successfully, but these errors were encountered: