Closed
Description
In some scenarios, a client will stop pushing metrics because it has gone away. Currently every node needs to be deleted explicitly or the last value will stick around forever. It would be good to be able to configure a timeout after which a metric is considered stale and removed. I think it would be best if the client could specify this.
Activity
brian-brazil commentedon Feb 3, 2015
You may be interested in the textfile module of the node_exporter. It allows you to export information on the local filesystem, and as it's on the node will go away when the node does.
juliusv commentedon Feb 3, 2015
@matthiasr Actually this is a great point by @brian-brazil. We should simply move chef-client exporting to the node exporter, since it can be considered a per-node metric. Then the time series will go away automatically if the host is gone.
brian-brazil commentedon Feb 4, 2015
A use case for this has appeared, it may be a way to allow clients who really really want to push to do so; while offering some GC.
matthiasr commentedon Feb 4, 2015
@juliusv agree, that side-steps the issue in our case. But I think it's still something needed e.g. for cron jobs – an hourly cronjob may report to pushgateway, but after >1h that metric is no longer valid.
brian-brazil commentedon Feb 4, 2015
@matthiasr Hourly cronjobs is service-level monitoring of batch jobs which is the primary use case for the Pushgateway, you'd export that without any expiry, timestamps or other advanced things like that..
matthiasr commentedon Feb 4, 2015
Not necessarily … I'm not necessarily monitoring the job itself, but instead e.g. some complex calculated value from a Hadoop job.
But even when monitoring, say, the runtime of my cronjob, how would I tell whether it just always takes the same time or there has never been a run again? I'd rather have no metric if it didn't run than the metric from the last time it ran. At least in some cases, which is why I think it should be optional.
juliusv commentedon Feb 4, 2015
@matthiasr To expand on what Brian said, an hourly cronjob would push its last completion timestamp to the pushgateway. That way you can monitor (via
time() - last_successful_run_timestamp_seconds
) whether your batch job hasn't run for too long. The metric would still be ingested by Prometheus upon every scrape from the pushgateway and get a server-side current timestamp attached.juliusv commentedon Feb 4, 2015
See also http://prometheus.io/docs/practices/instrumentation/#batch-jobs
brian-brazil commentedon Feb 4, 2015
Also, http://prometheus.io/docs/instrumenting/pushing/#java-batch-job-example
Implement lifetime feature
14 remaining items