You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ingester readiness endpoint fails on ingester startup if there's a unhealthy ingester within this ring. This looks to create some confusion to users (eg. #2913) and I'm also not much sure of this logic makes sense when running Cortex chunks storage with WAL or the Cortex blocks storage.
I'm opening this PR to have a discussion about it. In particular:
My understanding is that it is used to stop rollout in case of any problems in the ring and make operator to investigate and fix (or just "forget" bad entry).
I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover
I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover
Yes, it's a starting point. However, I think the main use case for people is having Cortex self-healing. If you run on Kubernetes deployments (so so each new ingester will have a random ID), if you just loose 1 ingester its pod will be rescheduled by Kubernetes but will not join the ring because the previous one (that didn't cleanly shutdown) is unhealthy within the ring.
I run ingesters in a Deployment and I don't think I've observed that.
They will join the ring, but the healthcheck for pod readiness senses that the ring is unhealthy and will lead to the pod that joined to replace the failed one to restart as well.
Today I see a message was added as part of the gossiping changes:
level=warn ts=2020-09-09T22:14:14.839462614Z caller=lifecycler.go:230 msg="found an existing instance(s) with a problem in the ring, this instance cannot complete joining and become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-5c45496986-k74wz past heartbeat timeout"
which is lying because the ingester printing this message is ACTIVE in the ring (although it is not ready).
Today I see a message was added as part of the gossiping changes:
It was actually added in #2936, because this state is confusing for people.
It is perhaps not very precise, but what it's trying to say is that Ingester cannot report "ready" via /ready handler, because there are other ingesters in the ring that are not ACTIVE.
To clarify my complaint:
"this instance cannot become ready until this problem is resolved" would be ok.
"this instance cannot complete joining" is incorrect: the ingester has joined the ring and is in ACTIVE state.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
We also have this problem where the ingester set doesn't come back up because of this catch-22 where the first ingester doesn't report readiness since it can't contact the others in the set, and the others are not scheduled to start until the first instance reports readiness.
I think this is the same issue reported here, and the issue is closed. Is that because there is a fix, or is the manual removal of instances from the ring deemed sufficient or something else? Just trying to get a grip on whether it needs to be pushed further.
Having exactly the same issue with 2.6.1. A single ingester got OOMKilled and brought the entire ingestion pipeline down. I couldn't even get a size=1 ingester ring to come up... it just flaps between PENDING, ACTIVE, and LEAVING, the same as a much larger one. To say the least, it is not building confidence in the ability to keep this thing stable enough to use in prod... there aren't even any docs for what to do about this.
carlosjgp, Kuzbekov, dansl1982, YoranSys, kiorky and 13 more
Seeing the same issue on v1.13.1; to be honest i'm not sure when /how exactly the ingestor pods "died" but let's assume OOMKilled, the thing is; the standard recovery pattern by K8S doesn't work since the new ingester's don't report ready since the old instances are still in the ring.
level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"
level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"
There's a flag that might make this better for you:
# When enabled the readiness probe succeeds only after all instances are
# ACTIVE and healthy in the ring, otherwise only the instance itself is
# checked. This option should be disabled if in your cluster multiple
# instances can be rolled out simultaneously, otherwise rolling updates may be
# slowed down.
# CLI flag: -ingester.readiness-check-ring-health
[readiness_check_ring_health: <boolean> | default = true]
You can try setting it to false.
Also, disabling the heartbeat on the ingester may be useful, because the distributor has its own health check to the ingester, so it's kind of redundant.
Activity
pstibrany commentedon Aug 14, 2020
My understanding is that it is used to stop rollout in case of any problems in the ring and make operator to investigate and fix (or just "forget" bad entry).
@tomwilkie will surely know reasoning here.
bboreham commentedon Aug 14, 2020
I also believe it is to halt a rolling update.
Related: #1521
slim-bean commentedon Aug 14, 2020
I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover
pracucci commentedon Aug 19, 2020
Yes, it's a starting point. However, I think the main use case for people is having Cortex self-healing. If you run on Kubernetes deployments (so so each new ingester will have a random ID), if you just loose 1 ingester its pod will be rescheduled by Kubernetes but will not join the ring because the previous one (that didn't cleanly shutdown) is unhealthy within the ring.
bboreham commentedon Aug 19, 2020
I run ingesters in a Deployment and I don't think I've observed that.
andrewl3wis commentedon Aug 19, 2020
They will join the ring, but the healthcheck for pod readiness senses that the ring is unhealthy and will lead to the pod that joined to replace the failed one to restart as well.
bboreham commentedon Aug 20, 2020
Not seen that either.
Can you show log files where a pod decides to restart because it is not ready?
bboreham commentedon Sep 10, 2020
Today I see a message was added as part of the gossiping changes:
which is lying because the ingester printing this message is ACTIVE in the ring (although it is not ready).
pstibrany commentedon Sep 10, 2020
It was actually added in #2936, because this state is confusing for people.
It is perhaps not very precise, but what it's trying to say is that Ingester cannot report "ready" via /ready handler, because there are other ingesters in the ring that are not ACTIVE.
bboreham commentedon Sep 10, 2020
To clarify my complaint:
"this instance cannot become ready until this problem is resolved" would be ok.
"this instance cannot complete joining" is incorrect: the ingester has joined the ring and is in ACTIVE state.
slim-bean commentedon Sep 10, 2020
@bboreham I created #3158 to correct the message.
stale commentedon Nov 9, 2020
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.
22 remaining items
slim-bean commentedon Jan 3, 2022
You can use the ring http endpoint to "forget" unhealthy entries and not have to scale to zero and back up.
Both Cortex and Loki both have a configuration parameter now to "auto forget" unhealthy ring entries too
bboreham commentedon Jan 4, 2022
I believe Cortex has this for store-gateway, ruler and alertmanager. Not ingester.
bboreham commentedon Jan 4, 2022
@zetaab your logs say there were two pods not heartbeating. No reason to connect that symptom with scaling.
ChrisSteinbach commentedon Jan 17, 2022
We also have this problem where the ingester set doesn't come back up because of this catch-22 where the first ingester doesn't report readiness since it can't contact the others in the set, and the others are not scheduled to start until the first instance reports readiness.
I think this is the same issue reported here, and the issue is closed. Is that because there is a fix, or is the manual removal of instances from the ring deemed sufficient or something else? Just trying to get a grip on whether it needs to be pushed further.
sharathfeb12 commentedon Jun 2, 2022
Seeing same issue on v2.5.0. Any help/pointers to get around this? Its biting us hard in prod :(
alongfield commentedon Jul 26, 2022
Having exactly the same issue with 2.6.1. A single ingester got OOMKilled and brought the entire ingestion pipeline down. I couldn't even get a size=1 ingester ring to come up... it just flaps between PENDING, ACTIVE, and LEAVING, the same as a much larger one. To say the least, it is not building confidence in the ability to keep this thing stable enough to use in prod... there aren't even any docs for what to do about this.
rmn-lux commentedon Oct 3, 2022
maybe this will help
paul-bormans commentedon Oct 17, 2022
Seeing the same issue on v1.13.1; to be honest i'm not sure when /how exactly the ingestor pods "died" but let's assume OOMKilled, the thing is; the standard recovery pattern by K8S doesn't work since the new ingester's don't report ready since the old instances are still in the ring.
AkakievKD commentedon Sep 30, 2023
@rmn-lux this helped thanks
pranav-e6x commentedon Dec 30, 2023
facing the same issue, is it resolved?
vfzh commentedon Feb 22, 2024
facing the same issue, is it resolved?
illthizam-healthhelper commentedon Jun 23, 2024
still facing this issue
level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"
CharlieTLe commentedon Jun 23, 2024
There's a flag that might make this better for you:
You can try setting it to false.
Also, disabling the heartbeat on the ingester may be useful, because the distributor has its own health check to the ingester, so it's kind of redundant.
yongzhang commentedon Jul 9, 2025
I have the same issue when enabling pattern_ingestor, same solution as ingestor?