-
Notifications
You must be signed in to change notification settings - Fork 816
New ingesters not ready if there's a faulty ingester in the ring #3040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My understanding is that it is used to stop rollout in case of any problems in the ring and make operator to investigate and fix (or just "forget" bad entry). @tomwilkie will surely know reasoning here. |
I also believe it is to halt a rolling update. Related: #1521 |
I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover |
Yes, it's a starting point. However, I think the main use case for people is having Cortex self-healing. If you run on Kubernetes deployments (so so each new ingester will have a random ID), if you just loose 1 ingester its pod will be rescheduled by Kubernetes but will not join the ring because the previous one (that didn't cleanly shutdown) is unhealthy within the ring. |
I run ingesters in a Deployment and I don't think I've observed that. |
They will join the ring, but the healthcheck for pod readiness senses that the ring is unhealthy and will lead to the pod that joined to replace the failed one to restart as well. |
Not seen that either. |
Today I see a message was added as part of the gossiping changes:
which is lying because the ingester printing this message is ACTIVE in the ring (although it is not ready). |
It was actually added in #2936, because this state is confusing for people. It is perhaps not very precise, but what it's trying to say is that Ingester cannot report "ready" via /ready handler, because there are other ingesters in the ring that are not ACTIVE. |
To clarify my complaint: |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
AFIAK the problem still exist |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Still valid |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Still valid |
Still valid. |
did we find any resolution for this ? |
I don't think anything changed: the behaviour was added to halt rolling updates and we have not found a better approach. However you remind me: I wondered if we could add a Having read a bit more, I don't think this would materially change things, just move the discussion from "why isn't it ready" to "why hasn't it started"? |
I think we should work on solutions which work outside of K8S too. We're aware of companies running Cortex on-premise without K8S. |
I am still facing the same issue |
@Rahuly360 you can read previous discussion of your question at #1521.
It does. This behaviour is very specifically a signal to auto-deployment mechanisms not to start the next one. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Still valid. This is a major pain point for us currently because often ingesters can get shut down uncleanly sometimes on Kubernetes for a multitude of reasons, and it seems like the chore of going to the distributor web UI and clicking "forget" can be easily automated away. |
it seems like ingesters needs to be started one by one, if we just scale 3 pods up from 0 it cannot handle the situation. So instead of scaling it to 3 I need to scale it first to 1, then wait that it starts. Then scale more pods.
|
what is the resolution for this? |
You can use the ring http endpoint to "forget" unhealthy entries and not have to scale to zero and back up. Both Cortex and Loki both have a configuration parameter now to "auto forget" unhealthy ring entries too |
I believe Cortex has this for store-gateway, ruler and alertmanager. Not ingester. |
@zetaab your logs say there were two pods not heartbeating. No reason to connect that symptom with scaling.
|
We also have this problem where the ingester set doesn't come back up because of this catch-22 where the first ingester doesn't report readiness since it can't contact the others in the set, and the others are not scheduled to start until the first instance reports readiness. I think this is the same issue reported here, and the issue is closed. Is that because there is a fix, or is the manual removal of instances from the ring deemed sufficient or something else? Just trying to get a grip on whether it needs to be pushed further. |
Seeing same issue on v2.5.0. Any help/pointers to get around this? Its biting us hard in prod :( |
Having exactly the same issue with 2.6.1. A single ingester got OOMKilled and brought the entire ingestion pipeline down. I couldn't even get a size=1 ingester ring to come up... it just flaps between PENDING, ACTIVE, and LEAVING, the same as a much larger one. To say the least, it is not building confidence in the ability to keep this thing stable enough to use in prod... there aren't even any docs for what to do about this. |
maybe this will help
|
Seeing the same issue on v1.13.1; to be honest i'm not sure when /how exactly the ingestor pods "died" but let's assume OOMKilled, the thing is; the standard recovery pattern by K8S doesn't work since the new ingester's don't report ready since the old instances are still in the ring. |
@rmn-lux this helped thanks |
facing the same issue, is it resolved? |
1 similar comment
facing the same issue, is it resolved? |
still facing this issue level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout" |
There's a flag that might make this better for you:
You can try setting it to false. Also, disabling the heartbeat on the ingester may be useful, because the distributor has its own health check to the ingester, so it's kind of redundant. |
The ingester readiness endpoint fails on ingester startup if there's a unhealthy ingester within this ring. This looks to create some confusion to users (eg. #2913) and I'm also not much sure of this logic makes sense when running Cortex chunks storage with WAL or the Cortex blocks storage.
I'm opening this PR to have a discussion about it. In particular:
The text was updated successfully, but these errors were encountered: