Skip to content

New ingesters not ready if there's a faulty ingester in the ring #3040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pracucci opened this issue Aug 14, 2020 · 40 comments
Closed

New ingesters not ready if there's a faulty ingester in the ring #3040

pracucci opened this issue Aug 14, 2020 · 40 comments

Comments

@pracucci
Copy link
Contributor

The ingester readiness endpoint fails on ingester startup if there's a unhealthy ingester within this ring. This looks to create some confusion to users (eg. #2913) and I'm also not much sure of this logic makes sense when running Cortex chunks storage with WAL or the Cortex blocks storage.

I'm opening this PR to have a discussion about it. In particular:

  1. Why this check was introduced?
  2. What would happen if we remove it?
@pstibrany
Copy link
Contributor

My understanding is that it is used to stop rollout in case of any problems in the ring and make operator to investigate and fix (or just "forget" bad entry).

@tomwilkie will surely know reasoning here.

@bboreham
Copy link
Contributor

I also believe it is to halt a rolling update.

Related: #1521

@slim-bean
Copy link
Contributor

I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover

@pracucci
Copy link
Contributor Author

I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover

Yes, it's a starting point. However, I think the main use case for people is having Cortex self-healing. If you run on Kubernetes deployments (so so each new ingester will have a random ID), if you just loose 1 ingester its pod will be rescheduled by Kubernetes but will not join the ring because the previous one (that didn't cleanly shutdown) is unhealthy within the ring.

@bboreham
Copy link
Contributor

will not join the ring

I run ingesters in a Deployment and I don't think I've observed that.

@andrewl3wis
Copy link

will not join the ring

I run ingesters in a Deployment and I don't think I've observed that.

They will join the ring, but the healthcheck for pod readiness senses that the ring is unhealthy and will lead to the pod that joined to replace the failed one to restart as well.

@bboreham
Copy link
Contributor

Not seen that either.
Can you show log files where a pod decides to restart because it is not ready?

@bboreham
Copy link
Contributor

Today I see a message was added as part of the gossiping changes:

level=warn ts=2020-09-09T22:14:14.839462614Z caller=lifecycler.go:230 msg="found an existing instance(s) with a problem in the ring, this instance cannot complete joining and become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-5c45496986-k74wz past heartbeat timeout"

which is lying because the ingester printing this message is ACTIVE in the ring (although it is not ready).

@pstibrany
Copy link
Contributor

pstibrany commented Sep 10, 2020

Today I see a message was added as part of the gossiping changes:

It was actually added in #2936, because this state is confusing for people.

It is perhaps not very precise, but what it's trying to say is that Ingester cannot report "ready" via /ready handler, because there are other ingesters in the ring that are not ACTIVE.

@bboreham
Copy link
Contributor

To clarify my complaint:
"this instance cannot become ready until this problem is resolved" would be ok.
"this instance cannot complete joining" is incorrect: the ingester has joined the ring and is in ACTIVE state.

@slim-bean
Copy link
Contributor

@bboreham I created #3158 to correct the message.

@stale
Copy link

stale bot commented Nov 9, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 9, 2020
@eraac
Copy link

eraac commented Nov 10, 2020

AFIAK the problem still exist

@stale stale bot removed the stale label Nov 10, 2020
@stale
Copy link

stale bot commented Jan 9, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 9, 2021
@pracucci
Copy link
Contributor Author

Still valid

@stale stale bot removed the stale label Jan 11, 2021
@stale
Copy link

stale bot commented Apr 11, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 11, 2021
@bboreham
Copy link
Contributor

Still valid

@stale stale bot removed the stale label Apr 12, 2021
@aleksanderllada
Copy link

Still valid.

@vikrantsde
Copy link

did we find any resolution for this ?

@bboreham
Copy link
Contributor

bboreham commented Jul 7, 2021

I don't think anything changed: the behaviour was added to halt rolling updates and we have not found a better approach.

However you remind me: I wondered if we could add a /startup handler intended to be used with Kubernetes startupProbe, and simplify the /ready behaviour. StartupProbe was added as Alpha in K8s 1.16 and on by default from 1.18.

Having read a bit more, I don't think this would materially change things, just move the discussion from "why isn't it ready" to "why hasn't it started"?

@pracucci
Copy link
Contributor Author

pracucci commented Jul 8, 2021

However you remind me: I wondered if we could add a /startup handler intended to be used with Kubernetes startupProbe, and simplify the /ready behaviour. StartupProbe was added as Alpha in K8s 1.16 and on by default from 1.18.

I think we should work on solutions which work outside of K8S too. We're aware of companies running Cortex on-premise without K8S.

@Rahuly360
Copy link

I am still facing the same issue
msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-7577dd5555-cgqqt past heartbeat timeout"
This issue is only resolved when we forget the Ingester ring from Distributor in Cortex.
But it's not a good solution to solve this. Ingester should start automatically.
Is anyone found any solution for auto-healing the Ingester in Cortex?

@bboreham
Copy link
Contributor

bboreham commented Jul 27, 2021

@Rahuly360 you can read previous discussion of your question at #1521.

Ingester should start automatically.

It does. This behaviour is very specifically a signal to auto-deployment mechanisms not to start the next one.

@stale
Copy link

stale bot commented Oct 26, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 26, 2021
@stale stale bot closed this as completed Nov 11, 2021
@ebr
Copy link

ebr commented Nov 11, 2021

Still valid. This is a major pain point for us currently because often ingesters can get shut down uncleanly sometimes on Kubernetes for a multitude of reasons, and it seems like the chore of going to the distributor web UI and clicking "forget" can be easily automated away.

@zetaab
Copy link

zetaab commented Dec 7, 2021

it seems like ingesters needs to be started one by one, if we just scale 3 pods up from 0 it cannot handle the situation. So instead of scaling it to 3 I need to scale it first to 1, then wait that it starts. Then scale more pods.

level=info ts=2021-12-07T12:16:19.117292027Z caller=lifecycler.go:754 msg="changing instance state from" old_state=JOINING new_state=ACTIVE ring=ingester
level=warn ts=2021-12-07T12:18:28.175915362Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-7444b9cb6d-lczgp past heartbeat timeout"
level=warn ts=2021-12-07T12:18:58.173087971Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-7444b9cb6d-lczgp past heartbeat timeout"
level=warn ts=2021-12-07T12:19:28.172621093Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-7444b9cb6d-lczgp past heartbeat timeout"
level=warn ts=2021-12-07T12:19:58.173667613Z caller=lifecycler.go:237 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance cortex-ingester-7444b9cb6d-t84gt past heartbeat timeout"

@JohnMops
Copy link

JohnMops commented Jan 2, 2022

what is the resolution for this?
If loki fails then it has to be manually scaled to 0 and spinned 1 by 1

@slim-bean
Copy link
Contributor

slim-bean commented Jan 3, 2022

You can use the ring http endpoint to "forget" unhealthy entries and not have to scale to zero and back up.

Both Cortex and Loki both have a configuration parameter now to "auto forget" unhealthy ring entries too

@bboreham
Copy link
Contributor

bboreham commented Jan 4, 2022

Both Cortex and Loki both have a configuration parameter now to "auto forget" unhealthy ring entries too

I believe Cortex has this for store-gateway, ruler and alertmanager. Not ingester.

@bboreham
Copy link
Contributor

bboreham commented Jan 4, 2022

if we just scale 3 pods up from 0 it cannot handle the situation

@zetaab your logs say there were two pods not heartbeating. No reason to connect that symptom with scaling.

"instance cortex-ingester-7444b9cb6d-lczgp past heartbeat timeout"

@ChrisSteinbach
Copy link

We also have this problem where the ingester set doesn't come back up because of this catch-22 where the first ingester doesn't report readiness since it can't contact the others in the set, and the others are not scheduled to start until the first instance reports readiness.

I think this is the same issue reported here, and the issue is closed. Is that because there is a fix, or is the manual removal of instances from the ring deemed sufficient or something else? Just trying to get a grip on whether it needs to be pushed further.

@sharathfeb12
Copy link

Seeing same issue on v2.5.0. Any help/pointers to get around this? Its biting us hard in prod :(

@alongfield
Copy link

Having exactly the same issue with 2.6.1. A single ingester got OOMKilled and brought the entire ingestion pipeline down. I couldn't even get a size=1 ingester ring to come up... it just flaps between PENDING, ACTIVE, and LEAVING, the same as a much larger one. To say the least, it is not building confidence in the ability to keep this thing stable enough to use in prod... there aren't even any docs for what to do about this.

@rmn-lux
Copy link

rmn-lux commented Oct 3, 2022

maybe this will help

ingester:
      autoforget_unhealthy: true

@paul-bormans
Copy link

Seeing the same issue on v1.13.1; to be honest i'm not sure when /how exactly the ingestor pods "died" but let's assume OOMKilled, the thing is; the standard recovery pattern by K8S doesn't work since the new ingester's don't report ready since the old instances are still in the ring.

@AkakievKD
Copy link

@rmn-lux this helped thanks

@pranav-e6x
Copy link

facing the same issue, is it resolved?

1 similar comment
@vfzh
Copy link

vfzh commented Feb 22, 2024

facing the same issue, is it resolved?

@illthizam-healthhelper
Copy link

still facing this issue

level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"

@CharlieTLe
Copy link
Member

CharlieTLe commented Jun 23, 2024

still facing this issue

level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"

There's a flag that might make this better for you:

 # When enabled the readiness probe succeeds only after all instances are
  # ACTIVE and healthy in the ring, otherwise only the instance itself is
  # checked. This option should be disabled if in your cluster multiple
  # instances can be rolled out simultaneously, otherwise rolling updates may be
  # slowed down.
  # CLI flag: -ingester.readiness-check-ring-health
  [readiness_check_ring_health: <boolean> | default = true]

You can try setting it to false.

Also, disabling the heartbeat on the ingester may be useful, because the distributor has its own health check to the ingester, so it's kind of redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests