-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Closed
Labels
staleA stale issue or PR that will automatically be closed.A stale issue or PR that will automatically be closed.
Description
Bug
When instances unexpecedetly leave a consul-ring and rejoin, the instances won't work anymore until the consul-ring is deleted.
To Reproduce
Steps to reproduce the behavior:
- Started two Loki 1.5.0 containers in monolith-mode. They use the same config (see below) and use Consul as ring. They use the new boltdbshipper store and filesystem, however the problem existed in loki 1.4.0, too.
docker run -p 3100:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml
docker run -p 3101:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml
- kill one container with
docker kill
- A new container automatically (trough nomad) comes up and joins the cluster
- After joining the cluster and querying either one of the instances, both fail with:
level=error ts=2020-05-26T10:43:53.883196602Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
- To fix the issue, I have to delete the ring in consul. Then both instances auto-join again and everything works.
Same issue as described here: #1159 (comment) and here #660
Expected behavior
The instances continue to work and do not fail.
Environment:
- Infrastructure: containers on bare-metal
- Deployment tool: nomad
Config
auth_enabled: false
server:
http_listen_port: 3100
ingester:
max_transfer_retries: 0 # Disable blocks transfers on ingesters shutdown or rollout.
chunk_idle_period: 2h # Let chunks sit idle for at least 2h before flushing, this helps to reduce total chunks in store
max_chunk_age: 2h # Let chunks get at least 2h old before flushing due to age, this helps to reduce total chunks in store
chunk_target_size: 1048576 # Target chunks of 1MB, this helps to reduce total chunks in store
chunk_retain_period: 30s
lifecycler:
join_after: 5s
ring:
kvstore:
store: consul
consul:
host: "nomad-servers.service.consul:8500"
replication_factor: 1
schema_config:
configs:
- from: 2018-04-15
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 1680h
storage_config:
boltdb_shipper:
shared_store: filesystem
active_index_directory: /tmp/loki/index
cache_location: /tmp/loki/boltdb-cache
filesystem:
directory: /tmp/loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 1680h
# Per-user ingestion rate limit in sample size per second. Units in MB.
ingestion_rate_mb: 8
# Per-user allowed ingestion burst size (in sample size). Units in MB.
# The burst size refers to the per-distributor local rate limiter even in the
# case of the "global" strategy, and should be set at least to the maximum logs
# size expected in a single push request.
ingestion_burst_size_mb: 16
chunk_store_config:
max_look_back_period: 0s # No limit how far we can look back in the store
table_manager:
chunk_tables_provisioning:
inactive_read_throughput: 0
inactive_write_throughput: 0
provisioned_read_throughput: 0
provisioned_write_throughput: 0
index_tables_provisioning:
inactive_read_throughput: 0
inactive_write_throughput: 0
provisioned_read_throughput: 0
provisioned_write_throughput: 0
retention_deletes_enabled: true
retention_period: 1680h
rodolfo-picoreti
Metadata
Metadata
Assignees
Labels
staleA stale issue or PR that will automatically be closed.A stale issue or PR that will automatically be closed.
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
cyriltovena commentedon May 27, 2020
Can you try with docker stop instead ? I'm wondering if the container has time to correctly cleanup.
rndmh3ro commentedon May 27, 2020
Yes, it happens with
docker stop
, too. I also would have thought, that there's a problem with the cleanup. Anything more that I can do to help debug it?rndmh3ro commentedon May 27, 2020
Here's the log from the new container after stopping one with docker stop:
cyriltovena commentedon May 27, 2020
@pstibrany do you know what's going on ?
pstibrany commentedon May 27, 2020
Can you please include screenshot of /ring page when this situation happens?
rndmh3ro commentedon May 28, 2020
Before:

After:

Interestingly querying the new instance works for some time but then suddenly stops with the "too many failed ingesters" message. Probably because it starts failing when the instances try to insert data?
pstibrany commentedon May 28, 2020
I'm sorry for misunderstading. I meant /ring page on Loki distributor. It shows ring status in decoded form.
rndmh3ro commentedon May 28, 2020
Before:

After:



After deleting the ring:

dginther commentedon Jun 10, 2020
Also having this problem.
pstibrany commentedon Jun 10, 2020
This is how it's designed to work. Killing instance without giving it a chance to cleanup (docker kill) or not giving it enough time to clean up (eg. by using spot/preemptible instances) can leave bad entry in the ring behind.
Now with replication factor one, I wouldn't expect failures on writes. There was recently a bug in Cortex (cortexproject/cortex#2503) which fixes similar [but perhaps not quite the same] problem, I wonder if that helps here as well – it's already on Loki master.
On reads -- I think the error actually makes sense with RF 1, as querier assumes that unhealthy ingester is the only ingester with data. (That's how I understand it)
pstibrany commentedon Jun 10, 2020
This affects startup of new ingesters too, btw. New ingesters will not get healthy if they find any unhealthy entry in the ring. The idea is to prevent rollout if something fails, and trigger alarm instead.
rndmh3ro commentedon Jun 11, 2020
Okay, I investigated further. This time I have a replication factor of 2. I tested it with both 1.5.0 and master, same result:
Is this the expected behaviour?
We then have to monitor for the above error messaged and manually forget the unhealthy instances?
I wonder if forgetting unhealthy instances if enough healthy instance are available is an option,
stale commentedon Jul 11, 2020
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
iamparamvs commentedon Sep 16, 2020
I am facing a similar issue in Loki (currently using 1.6.1)
level=info ts=2020-09-16T14:17:59.53488108Z caller=loki.go:210 msg="Loki started"
level=info ts=2020-09-16T14:18:00.116019272Z caller=memberlist_client.go:460 msg="joined memberlist cluster" reached_nodes=1
level=warn ts=2020-09-16T14:18:03.58007538Z caller=logging.go:62 traceID=1e54d622565652d3 msg="POST /loki/api/v1/push (500) 1.024434ms Response: "empty ring\n" ws: false; Content-Length: 2231; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=warn ts=2020-09-16T14:18:03.638932001Z caller=logging.go:62 traceID=4a7fc243566882a2 msg="POST /loki/api/v1/push (500) 11.736089ms Response: "empty ring\n" ws: false; Content-Length: 7424; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=warn ts=2020-09-16T14:18:03.901275574Z caller=logging.go:62 traceID=4f159ae8e3f67e3d msg="POST /loki/api/v1/push (500) 1.212358ms Response: "empty ring\n" ws: false; Content-Length: 1660; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=warn ts=2020-09-16T14:18:04.244459805Z caller=logging.go:62 traceID=101d6593fa0cd3f7 msg="POST /loki/api/v1/push (500) 622.127µs Response: "empty ring\n" ws: false; Content-Length: 2865; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=error ts=2020-09-16T14:18:14.524604208Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"
level=error ts=2020-09-16T14:18:14.636547958Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"
glebsa8 commentedon Sep 29, 2020
Clearing previously registered ingesters from Consul fixed the issue for me. But I don't know how safe is this solution in production. I've executed
here
loki-collectors
is a prefix from Loki configjosephmilla commentedon Nov 20, 2020
Saw this same issue with an etcd setup. Had to do this in etcd to make it work