Skip to content

New ingesters not ready if there's a faulty ingester in the ring #3040

@pracucci

Description

@pracucci
Contributor

The ingester readiness endpoint fails on ingester startup if there's a unhealthy ingester within this ring. This looks to create some confusion to users (eg. #2913) and I'm also not much sure of this logic makes sense when running Cortex chunks storage with WAL or the Cortex blocks storage.

I'm opening this PR to have a discussion about it. In particular:

  1. Why this check was introduced?
  2. What would happen if we remove it?

Activity

pstibrany

pstibrany commented on Aug 14, 2020

@pstibrany
Contributor

My understanding is that it is used to stop rollout in case of any problems in the ring and make operator to investigate and fix (or just "forget" bad entry).

@tomwilkie will surely know reasoning here.

bboreham

bboreham commented on Aug 14, 2020

@bboreham
Contributor

I also believe it is to halt a rolling update.

Related: #1521

slim-bean

slim-bean commented on Aug 14, 2020

@slim-bean
Contributor

I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover

pracucci

pracucci commented on Aug 19, 2020

@pracucci
ContributorAuthor

I recently add #2936 to help users diagnose this problem, doesn't address the question of changing the behavior but should make it easier for operators to diagnose and recover

Yes, it's a starting point. However, I think the main use case for people is having Cortex self-healing. If you run on Kubernetes deployments (so so each new ingester will have a random ID), if you just loose 1 ingester its pod will be rescheduled by Kubernetes but will not join the ring because the previous one (that didn't cleanly shutdown) is unhealthy within the ring.

bboreham

bboreham commented on Aug 19, 2020

@bboreham
Contributor

will not join the ring

I run ingesters in a Deployment and I don't think I've observed that.

andrewl3wis

andrewl3wis commented on Aug 19, 2020

@andrewl3wis

will not join the ring

I run ingesters in a Deployment and I don't think I've observed that.

They will join the ring, but the healthcheck for pod readiness senses that the ring is unhealthy and will lead to the pod that joined to replace the failed one to restart as well.

bboreham

bboreham commented on Aug 20, 2020

@bboreham
Contributor

Not seen that either.
Can you show log files where a pod decides to restart because it is not ready?

bboreham

bboreham commented on Sep 10, 2020

@bboreham
Contributor

Today I see a message was added as part of the gossiping changes:

level=warn ts=2020-09-09T22:14:14.839462614Z caller=lifecycler.go:230 msg="found an existing instance(s) with a problem in the ring, this instance cannot complete joining and become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance ingester-5c45496986-k74wz past heartbeat timeout"

which is lying because the ingester printing this message is ACTIVE in the ring (although it is not ready).

pstibrany

pstibrany commented on Sep 10, 2020

@pstibrany
Contributor

Today I see a message was added as part of the gossiping changes:

It was actually added in #2936, because this state is confusing for people.

It is perhaps not very precise, but what it's trying to say is that Ingester cannot report "ready" via /ready handler, because there are other ingesters in the ring that are not ACTIVE.

bboreham

bboreham commented on Sep 10, 2020

@bboreham
Contributor

To clarify my complaint:
"this instance cannot become ready until this problem is resolved" would be ok.
"this instance cannot complete joining" is incorrect: the ingester has joined the ring and is in ACTIVE state.

slim-bean

slim-bean commented on Sep 10, 2020

@slim-bean
Contributor

@bboreham I created #3158 to correct the message.

stale

stale commented on Nov 9, 2020

@stale

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

22 remaining items

slim-bean

slim-bean commented on Jan 3, 2022

@slim-bean
Contributor

You can use the ring http endpoint to "forget" unhealthy entries and not have to scale to zero and back up.

Both Cortex and Loki both have a configuration parameter now to "auto forget" unhealthy ring entries too

bboreham

bboreham commented on Jan 4, 2022

@bboreham
Contributor

Both Cortex and Loki both have a configuration parameter now to "auto forget" unhealthy ring entries too

I believe Cortex has this for store-gateway, ruler and alertmanager. Not ingester.

bboreham

bboreham commented on Jan 4, 2022

@bboreham
Contributor

if we just scale 3 pods up from 0 it cannot handle the situation

@zetaab your logs say there were two pods not heartbeating. No reason to connect that symptom with scaling.

"instance cortex-ingester-7444b9cb6d-lczgp past heartbeat timeout"

ChrisSteinbach

ChrisSteinbach commented on Jan 17, 2022

@ChrisSteinbach

We also have this problem where the ingester set doesn't come back up because of this catch-22 where the first ingester doesn't report readiness since it can't contact the others in the set, and the others are not scheduled to start until the first instance reports readiness.

I think this is the same issue reported here, and the issue is closed. Is that because there is a fix, or is the manual removal of instances from the ring deemed sufficient or something else? Just trying to get a grip on whether it needs to be pushed further.

sharathfeb12

sharathfeb12 commented on Jun 2, 2022

@sharathfeb12

Seeing same issue on v2.5.0. Any help/pointers to get around this? Its biting us hard in prod :(

alongfield

alongfield commented on Jul 26, 2022

@alongfield

Having exactly the same issue with 2.6.1. A single ingester got OOMKilled and brought the entire ingestion pipeline down. I couldn't even get a size=1 ingester ring to come up... it just flaps between PENDING, ACTIVE, and LEAVING, the same as a much larger one. To say the least, it is not building confidence in the ability to keep this thing stable enough to use in prod... there aren't even any docs for what to do about this.

rmn-lux

rmn-lux commented on Oct 3, 2022

@rmn-lux

maybe this will help

ingester:
      autoforget_unhealthy: true
paul-bormans

paul-bormans commented on Oct 17, 2022

@paul-bormans

Seeing the same issue on v1.13.1; to be honest i'm not sure when /how exactly the ingestor pods "died" but let's assume OOMKilled, the thing is; the standard recovery pattern by K8S doesn't work since the new ingester's don't report ready since the old instances are still in the ring.

AkakievKD

AkakievKD commented on Sep 30, 2023

@AkakievKD

@rmn-lux this helped thanks

pranav-e6x

pranav-e6x commented on Dec 30, 2023

@pranav-e6x

facing the same issue, is it resolved?

vfzh

vfzh commented on Feb 22, 2024

@vfzh

facing the same issue, is it resolved?

illthizam-healthhelper

illthizam-healthhelper commented on Jun 23, 2024

@illthizam-healthhelper

still facing this issue

level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"

CharlieTLe

CharlieTLe commented on Jun 23, 2024

@CharlieTLe
Member

still facing this issue

level=warn ts=2024-06-23T15:14:04.57261201Z caller=lifecycler.go:291 component=ingester msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 192.168.116.137:9095 past heartbeat timeout"

There's a flag that might make this better for you:

 # When enabled the readiness probe succeeds only after all instances are
  # ACTIVE and healthy in the ring, otherwise only the instance itself is
  # checked. This option should be disabled if in your cluster multiple
  # instances can be rolled out simultaneously, otherwise rolling updates may be
  # slowed down.
  # CLI flag: -ingester.readiness-check-ring-health
  [readiness_check_ring_health: <boolean> | default = true]

You can try setting it to false.

Also, disabling the heartbeat on the ingester may be useful, because the distributor has its own health check to the ingester, so it's kind of redundant.

yongzhang

yongzhang commented on Jul 9, 2025

@yongzhang

I have the same issue when enabling pattern_ingestor, same solution as ingestor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @pstibrany@pracucci@ebr@andrewl3wis@ChrisSteinbach

        Issue actions

          New ingesters not ready if there's a faulty ingester in the ring · Issue #3040 · cortexproject/cortex