Closed
Description
Describe the bug
In kubernetes world, if I delete one Ingester pod, the new one will join the Cortex ring, however the old one still exists in the ring with status "unhealthy". This will cause 500 error in Distributor for unhealthy ingester.
To Reproduce
Steps to reproduce the behavior:
- Started several Loki Ingesters.
- When they join the Cortex ring, delete one Ingester.
- Let the new Ingester comes up, and show up in the Cortext ring.
Error info found in Distributor
level=error ts=2019-10-16T02:47:56.023893394Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"
Expected behavior
Ingester should leave Ring properly.
Environment:
- Infrastructure: Kubernetes
- 1 Distributor + 6 Ingesters,
replication_factor: 1
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
mizeng commentedon Oct 16, 2019
Seems related to below logic
So for my settings
replication_factor: 1
, maxErrors=0, then during loop, it will find one unhealthy ingester, then maxErrors=-1. Then the errortoo many failed ingesters
appears, and blockpool.removeStaleClients
.In short, if
replication_factor
is 1, there's always be this kind of issue "Ingester can not leaving Ring". And have to manually click "forget".mizeng commentedon Oct 16, 2019
However even if I set
replication_factor=2
, when I delete one ingester, the error disappearsmsg="error removing stale clients" err="too many failed ingesters"
, but distributor did not delete the unhealthy ingester from Cortex Ring either.rfratto commentedon Oct 16, 2019
Do you have any logs from
ingester-7dff74c688-9mgsl
that show why it may not have left the ring properly?Also, what is the terminationGracePeriod set to for your ingesters? We run ours with a termination grace period of 4800 to make sure that the ingesters have enough time to cleanly shut down and leave the ring.
mizeng commentedon Oct 17, 2019
The pod
ingester-7dff74c688-9mgsl
is deleted from the cluster, so I have no logs. However, per my understandings, Distributor is responsible for remove one ingester from the ring, right?In other hand, I think the left Ingester has finished its work, set its status to unhealthy, wait the distributor to delete it. However due to current like-bug logic, it will hang for delete forever.
rfratto commentedon Oct 17, 2019
If you have Promtail monitoring the Loki components, you can use Loki to read logs from the deleted pod. This is how we tend to diagnose these kinds of issues in production.
Not quite. Ingesters handle their own insertion and deletion from the ring. When an ingester is shut down from k8s, after the shutdown process completes (i.e., either handing off its chunks to a pending ingester or flushing chunks to the store), it will update the ring in Consul and remove its own definition.
Distributors only read the ring to find which subset of ingesters should receive data for a particular stream.
mizeng commentedon Oct 17, 2019
Thanks for the info!
I will look through the Ingester code again to triage.
mizeng commentedon Oct 17, 2019
I've tried to enlarge "terminationGracePeriodSeconds" to give Ingester more time to finish transfer work during shutting down process. However still get below error:
Could we have below 2 lines configurable (MinBackoff, MaxBackOff)?
https://github.com/grafana/loki/blob/master/pkg/ingester/transfer.go#L142-L143
This made too short for Ingester to try to transfer its data to other ingester, hence new Ingester usually will need quite a few secs to start up.
mizeng commentedon Oct 17, 2019
Increase MaxTransferRetries=200, still get error:
mizeng commentedon Oct 17, 2019
So I think the new start-up Ingester go pass "Pending" state too soon, which cause the old Ingester can not find a "Pending" state Ingester to transfer. Try to increase "join-after" for Ingester stay more time in "Pending" State.
mizeng commentedon Oct 17, 2019
Increase "join-after" to 30s, manually test deleting one pod for several times, and Ingester removed from consul normally.
rfratto commentedon Oct 17, 2019
Yes, the
join-after
should be set to a period that's high enough for the leaving ingester to discover the joining ingester.It's interesting why it's not exiting from consul when it doesn't transfer successfully; it's definitely supposed to. I would expect something like this to be in the logs, followed by a flush and removal form consul:
I'm suspecting there might be a crash somewhere when the ingesters falls back to doing a flush on exit, and it might happen too quick for it to show up in the Docker logs. What are you using for the chunk and index store?
mizeng commentedon Oct 18, 2019
Chunk: Ceph Obj Store with S3 API
Index: ElasticSearch
HuippuJanne commentedon Nov 13, 2019
With loki:v0.4.0 it's possible to use flag "ingester.max-transfer-retries: 0" which somebody recommended playing around with this error in Loki slack chat. DON'T DO THAT; the error is still happening but you cannot even recover by deleting the ring (e.g. curl -XDELETE localhost:8500/v1/kv/collectors/ring).
By this day, the only known way to work around this annoying, constantly happening defect is to:
stale commentedon Dec 13, 2019
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
createdanew commentedon Jan 21, 2020
So was there a definitive answer to this? I haven't been able to discover a proper solution. I'm also having issues with "Unhealthy" ingestors not leaving the ring.
sfro commentedon Feb 20, 2020
We are having the same issue with version v1.3.0. Currently playing around with some of the parameters mentioned in this issue.
senior88oqz commentedon Mar 27, 2020
Anyone have a good way to get around this? Experiencing a similar problem when ingester pods are killed unexpectedly (i.e,
OOM Killed
). Could only bring them back to healthy state by delete and redeploy the ring