-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Ingester not leaving Ring properly again #1159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seems related to below logic
So for my settings In short, if |
However even if I set |
Do you have any logs from Also, what is the terminationGracePeriod set to for your ingesters? We run ours with a termination grace period of 4800 to make sure that the ingesters have enough time to cleanly shut down and leave the ring. |
The pod In other hand, I think the left Ingester has finished its work, set its status to unhealthy, wait the distributor to delete it. However due to current like-bug logic, it will hang for delete forever. |
If you have Promtail monitoring the Loki components, you can use Loki to read logs from the deleted pod. This is how we tend to diagnose these kinds of issues in production.
Not quite. Ingesters handle their own insertion and deletion from the ring. When an ingester is shut down from k8s, after the shutdown process completes (i.e., either handing off its chunks to a pending ingester or flushing chunks to the store), it will update the ring in Consul and remove its own definition. Distributors only read the ring to find which subset of ingesters should receive data for a particular stream. |
Thanks for the info! I will look through the Ingester code again to triage. |
I've tried to enlarge "terminationGracePeriodSeconds" to give Ingester more time to finish transfer work during shutting down process. However still get below error:
Could we have below 2 lines configurable (MinBackoff, MaxBackOff)? This made too short for Ingester to try to transfer its data to other ingester, hence new Ingester usually will need quite a few secs to start up.
|
Increase MaxTransferRetries=200, still get error:
|
So I think the new start-up Ingester go pass "Pending" state too soon, which cause the old Ingester can not find a "Pending" state Ingester to transfer. Try to increase "join-after" for Ingester stay more time in "Pending" State. |
Increase "join-after" to 30s, manually test deleting one pod for several times, and Ingester removed from consul normally.
|
Yes, the It's interesting why it's not exiting from consul when it doesn't transfer successfully; it's definitely supposed to. I would expect something like this to be in the logs, followed by a flush and removal form consul:
I'm suspecting there might be a crash somewhere when the ingesters falls back to doing a flush on exit, and it might happen too quick for it to show up in the Docker logs. What are you using for the chunk and index store? |
Chunk: Ceph Obj Store with S3 API |
With loki:v0.4.0 it's possible to use flag "ingester.max-transfer-retries: 0" which somebody recommended playing around with this error in Loki slack chat. DON'T DO THAT; the error is still happening but you cannot even recover by deleting the ring (e.g. curl -XDELETE localhost:8500/v1/kv/collectors/ring). By this day, the only known way to work around this annoying, constantly happening defect is to:
|
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
So was there a definitive answer to this? I haven't been able to discover a proper solution. I'm also having issues with "Unhealthy" ingestors not leaving the ring. |
We are having the same issue with version v1.3.0. Currently playing around with some of the parameters mentioned in this issue. |
Anyone have a good way to get around this? Experiencing a similar problem when ingester pods are killed unexpectedly (i.e, |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
In kubernetes world, if I delete one Ingester pod, the new one will join the Cortex ring, however the old one still exists in the ring with status "unhealthy". This will cause 500 error in Distributor for unhealthy ingester.
To Reproduce
Steps to reproduce the behavior:
Error info found in Distributor
Expected behavior
Ingester should leave Ring properly.
Environment:
replication_factor: 1
Screenshots for Cortex Ring

The text was updated successfully, but these errors were encountered: