-
Notifications
You must be signed in to change notification settings - Fork 3.7k
too many failed ingesters #2131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you try with docker stop instead ? I'm wondering if the container has time to correctly cleanup. |
Yes, it happens with |
Here's the log from the new container after stopping one with docker stop:
|
@pstibrany do you know what's going on ? |
Can you please include screenshot of /ring page when this situation happens? |
I'm sorry for misunderstading. I meant /ring page on Loki distributor. It shows ring status in decoded form. |
Also having this problem. |
This is how it's designed to work. Killing instance without giving it a chance to cleanup (docker kill) or not giving it enough time to clean up (eg. by using spot/preemptible instances) can leave bad entry in the ring behind. Now with replication factor one, I wouldn't expect failures on writes. There was recently a bug in Cortex (cortexproject/cortex#2503) which fixes similar [but perhaps not quite the same] problem, I wonder if that helps here as well – it's already on Loki master. On reads -- I think the error actually makes sense with RF 1, as querier assumes that unhealthy ingester is the only ingester with data. (That's how I understand it) |
This affects startup of new ingesters too, btw. New ingesters will not get healthy if they find any unhealthy entry in the ring. The idea is to prevent rollout if something fails, and trigger alarm instead. |
Okay, I investigated further. This time I have a replication factor of 2. I tested it with both 1.5.0 and master, same result:
Is this the expected behaviour? We then have to monitor for the above error messaged and manually forget the unhealthy instances? |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
I am facing a similar issue in Loki (currently using 1.6.1) level=info ts=2020-09-16T14:17:59.53488108Z caller=loki.go:210 msg="Loki started" level=error ts=2020-09-16T14:18:14.524604208Z caller=pool.go:161 msg="error removing stale clients" err="empty ring" |
Clearing previously registered ingesters from Consul fixed the issue for me. But I don't know how safe is this solution in production. I've executed
here ingester:
lifecycler:
ring:
kvstore:
store: consul
prefix: loki-collectors/ |
Saw this same issue with an etcd setup. Had to do this in etcd to make it work
|
Bug
When instances unexpecedetly leave a consul-ring and rejoin, the instances won't work anymore until the consul-ring is deleted.
To Reproduce
Steps to reproduce the behavior:
docker run -p 3100:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml
docker run -p 3101:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml
docker kill
Same issue as described here: #1159 (comment) and here #660
Expected behavior
The instances continue to work and do not fail.
Environment:
Config
The text was updated successfully, but these errors were encountered: