Skip to content

too many failed ingesters #2131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rndmh3ro opened this issue May 26, 2020 · 16 comments
Closed

too many failed ingesters #2131

rndmh3ro opened this issue May 26, 2020 · 16 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@rndmh3ro
Copy link

Bug

When instances unexpecedetly leave a consul-ring and rejoin, the instances won't work anymore until the consul-ring is deleted.

To Reproduce
Steps to reproduce the behavior:

  1. Started two Loki 1.5.0 containers in monolith-mode. They use the same config (see below) and use Consul as ring. They use the new boltdbshipper store and filesystem, however the problem existed in loki 1.4.0, too.
    • docker run -p 3100:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml
    • docker run -p 3101:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml
  2. kill one container with docker kill
  3. A new container automatically (trough nomad) comes up and joins the cluster
  4. After joining the cluster and querying either one of the instances, both fail with:
level=error ts=2020-05-26T10:43:53.883196602Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
  1. To fix the issue, I have to delete the ring in consul. Then both instances auto-join again and everything works.

Same issue as described here: #1159 (comment) and here #660

Expected behavior
The instances continue to work and do not fail.

Environment:

  • Infrastructure: containers on bare-metal
  • Deployment tool: nomad

Config

auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  max_transfer_retries: 0 # Disable blocks transfers on ingesters shutdown or rollout.
  chunk_idle_period: 2h # Let chunks sit idle for at least 2h before flushing, this helps to reduce total chunks in store
  max_chunk_age: 2h  # Let chunks get at least 2h old before flushing due to age, this helps to reduce total chunks in store
  chunk_target_size: 1048576 # Target chunks of 1MB, this helps to reduce total chunks in store
  chunk_retain_period: 30s

  lifecycler:
    join_after: 5s
    ring:
      kvstore:
        store: consul
        consul:
          host: "nomad-servers.service.consul:8500"
      replication_factor: 1

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 1680h

storage_config:
  boltdb_shipper:
    shared_store: filesystem
    active_index_directory: /tmp/loki/index
    cache_location: /tmp/loki/boltdb-cache

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 1680h
  # Per-user ingestion rate limit in sample size per second. Units in MB.
  ingestion_rate_mb: 8
  # Per-user allowed ingestion burst size (in sample size). Units in MB.
  # The burst size refers to the per-distributor local rate limiter even in the
  # case of the "global" strategy, and should be set at least to the maximum logs
  # size expected in a single push request.
  ingestion_burst_size_mb: 16
chunk_store_config:
  max_look_back_period: 0s  # No limit how far we can look back in the store

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  retention_deletes_enabled: true
  retention_period:  1680h
@cyriltovena
Copy link
Contributor

Can you try with docker stop instead ? I'm wondering if the container has time to correctly cleanup.

@rndmh3ro
Copy link
Author

Yes, it happens with docker stop, too. I also would have thought, that there's a problem with the cleanup. Anything more that I can do to help debug it?

@rndmh3ro
Copy link
Author

Here's the log from the new container after stopping one with docker stop:

level=info ts=2020-05-27T07:47:08.826449289Z caller=server.go:179 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2020-05-27T07:47:08.827406937Z caller=shipper.go:122 msg="starting boltdb shipper in 0 mode"
level=info ts=2020-05-27T07:47:08.8302528Z caller=worker.go:58 msg="no address specified, not starting worker"
level=info ts=2020-05-27T07:47:08.831199216Z caller=main.go:79 msg="Starting Loki" version="(version=1.5.0, branch=HEAD, revision=12c7eab8b)"
level=info ts=2020-05-27T07:47:08.831480629Z caller=module_service.go:58 msg=initialising module=runtime-config
level=info ts=2020-05-27T07:47:08.831513321Z caller=module_service.go:58 msg=initialising module=store
level=info ts=2020-05-27T07:47:08.831536088Z caller=module_service.go:58 msg=initialising module=table-manager
level=info ts=2020-05-27T07:47:08.83152124Z caller=manager.go:109 msg="runtime config disabled: file not specified"
level=info ts=2020-05-27T07:47:08.831649576Z caller=module_service.go:58 msg=initialising module=memberlist-kv
level=info ts=2020-05-27T07:47:08.831684028Z caller=module_service.go:58 msg=initialising module=ingester
level=info ts=2020-05-27T07:47:08.831724213Z caller=module_service.go:58 msg=initialising module=ring
level=info ts=2020-05-27T07:47:08.831889143Z caller=module_service.go:58 msg=initialising module=distributor
level=info ts=2020-05-27T07:47:08.832180803Z caller=table_manager.go:292 msg="synching tables" expected_tables=2
level=info ts=2020-05-27T07:47:08.832179475Z caller=lifecycler.go:485 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2020-05-27T07:47:08.832379803Z caller=table_manager.go:444 msg="creating table" table=index_261
level=info ts=2020-05-27T07:47:08.832435013Z caller=table_manager.go:499 msg="provisioned throughput on table, skipping" table=index_262 read=0 write=0
level=info ts=2020-05-27T07:47:08.832414672Z caller=loki.go:228 msg="Loki started"
level=info ts=2020-05-27T07:47:08.837329777Z caller=lifecycler.go:509 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2020-05-27T07:47:10.074336531Z caller=shipper.go:287 msg="downloading all files for period index_262"
level=info ts=2020-05-27T07:47:13.842780507Z caller=lifecycler.go:361 msg="auto-joining cluster after timeout" ring=ingester
level=info ts=2020-05-27T07:47:13.874205494Z caller=metrics.go:81 org_id=fake traceID=6acd3b4fb732e536 latency=fast query="{host=\"localhost\"}" query_type=limited range_type=range length=1h0m0s step=14s duration=10.398268ms status=200 throughput_mb=6.683516 total_bytes_mb=0.069497
level=info ts=2020-05-27T07:47:15.047103315Z caller=metrics.go:81 org_id=fake traceID=5616350efb9f3bbc latency=fast query="{host=\"localhost\"}" query_type=limited range_type=range length=1h0m0s step=14s duration=9.711527ms status=200 throughput_mb=7.156135 total_bytes_mb=0.069497
level=info ts=2020-05-27T07:47:40.237777854Z caller=metrics.go:81 org_id=fake traceID=32c613c1fc47b92b latency=fast query="{host=\"localhost\"}" query_type=limited range_type=range length=1h0m0s step=14s duration=7.782306ms status=200 throughput_mb=8.930129 total_bytes_mb=0.069497
level=info ts=2020-05-27T07:47:44.595807132Z caller=metrics.go:81 org_id=fake traceID=39823fa57adb4bca latency=fast query="{host=\"localhost\"}" query_type=limited range_type=range length=1h0m0s step=14s duration=9.92125ms status=200 throughput_mb=7.004863 total_bytes_mb=0.069497
level=error ts=2020-05-27T07:47:53.831523292Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=error ts=2020-05-27T07:47:53.832518327Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=info ts=2020-05-27T07:48:02.489284978Z caller=metrics.go:81 org_id=fake traceID=2e9efea88830a06 latency=fast query="{host=\"localhost\"}" query_type=limited range_type=range length=1h0m0s step=14s duration=1.57288ms status=500 throughput_mb=0 total_bytes_mb=0
level=warn ts=2020-05-27T07:48:02.489431485Z caller=logging.go:49 traceID=2e9efea88830a06 msg="GET /loki/api/v1/query_range?direction=BACKWARD&end=1590565682486525691&limit=30&query=%7Bhost%3D%22localhost%22%7D&start=1590562082486525691 (500) 1.838364ms Response: \"too many failed ingesters\\n\" ws: false; User-Agent: Go-http-client/1.1; "
level=error ts=2020-05-27T07:48:08.830482495Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
level=error ts=2020-05-27T07:48:08.83247069Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"

@cyriltovena
Copy link
Contributor

@pstibrany do you know what's going on ?

@pstibrany
Copy link
Member

Can you please include screenshot of /ring page when this situation happens?

@rndmh3ro
Copy link
Author

Before:
grafik

After:
Bildschirmfoto am 2020-05-28 um 11 14 38

Interestingly querying the new instance works for some time but then suddenly stops with the "too many failed ingesters" message. Probably because it starts failing when the instances try to insert data?

@pstibrany
Copy link
Member

I'm sorry for misunderstading. I meant /ring page on Loki distributor. It shows ring status in decoded form.

@rndmh3ro
Copy link
Author

Before:
grafik

After:
grafik
grafik
grafik

After deleting the ring:
grafik

@dginther
Copy link

Also having this problem.

@pstibrany
Copy link
Member

pstibrany commented Jun 10, 2020

This is how it's designed to work. Killing instance without giving it a chance to cleanup (docker kill) or not giving it enough time to clean up (eg. by using spot/preemptible instances) can leave bad entry in the ring behind.

Now with replication factor one, I wouldn't expect failures on writes. There was recently a bug in Cortex (cortexproject/cortex#2503) which fixes similar [but perhaps not quite the same] problem, I wonder if that helps here as well – it's already on Loki master.

On reads -- I think the error actually makes sense with RF 1, as querier assumes that unhealthy ingester is the only ingester with data. (That's how I understand it)

@pstibrany
Copy link
Member

leave bad entry in the ring behind.

This affects startup of new ingesters too, btw. New ingesters will not get healthy if they find any unhealthy entry in the ring. The idea is to prevent rollout if something fails, and trigger alarm instead.

@rndmh3ro
Copy link
Author

Okay, I investigated further. This time I have a replication factor of 2. I tested it with both 1.5.0 and master, same result:

  1. Start two loki instances with the same config
    grafik
  2. Both instances can be sucessfully queried for data
  3. kill one instance, it becomes unhealthy in the ring:
    grafik
  4. the running instance can still be queried
  5. No new writes possible, since only one replica is there (expected):
level=warn ts=2020-06-11T07:07:17.536227668Z caller=logging.go:49 traceID=64156b170b51db2 msg="POST /loki/api/v1/push (500) 177.325µs Response: \"at least 2 live replicas required, could only find 1\\n\" ws: false; Content-Length: 464; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
  1. bring up new instance:
    grafik
  2. querying both instances works
  3. ingesting does not work (expected as per your comment too many failed ingesters #2131 (comment))
level=warn ts=2020-06-11T07:12:31.039242372Z caller=logging.go:49 traceID=4085b83018dba3be msg="POST /loki/api/v1/push (500) 180.18µs Response: \"at least 2 live replicas required, could only find 1\\n\" ws: false; Content-Length: 410; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
  1. forget unhealthy instance in ring
  2. Both instances can be queried and ingest data

Is this the expected behaviour?

We then have to monitor for the above error messaged and manually forget the unhealthy instances?
I wonder if forgetting unhealthy instances if enough healthy instance are available is an option,

@stale
Copy link

stale bot commented Jul 11, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jul 11, 2020
@stale stale bot closed this as completed Jul 18, 2020
@iamparamvs
Copy link

I am facing a similar issue in Loki (currently using 1.6.1)

level=info ts=2020-09-16T14:17:59.53488108Z caller=loki.go:210 msg="Loki started"
level=info ts=2020-09-16T14:18:00.116019272Z caller=memberlist_client.go:460 msg="joined memberlist cluster" reached_nodes=1
level=warn ts=2020-09-16T14:18:03.58007538Z caller=logging.go:62 traceID=1e54d622565652d3 msg="POST /loki/api/v1/push (500) 1.024434ms Response: "empty ring\n" ws: false; Content-Length: 2231; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=warn ts=2020-09-16T14:18:03.638932001Z caller=logging.go:62 traceID=4a7fc243566882a2 msg="POST /loki/api/v1/push (500) 11.736089ms Response: "empty ring\n" ws: false; Content-Length: 7424; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=warn ts=2020-09-16T14:18:03.901275574Z caller=logging.go:62 traceID=4f159ae8e3f67e3d msg="POST /loki/api/v1/push (500) 1.212358ms Response: "empty ring\n" ws: false; Content-Length: 1660; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "
level=warn ts=2020-09-16T14:18:04.244459805Z caller=logging.go:62 traceID=101d6593fa0cd3f7 msg="POST /loki/api/v1/push (500) 622.127µs Response: "empty ring\n" ws: false; Content-Length: 2865; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; "

level=error ts=2020-09-16T14:18:14.524604208Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"
level=error ts=2020-09-16T14:18:14.636547958Z caller=pool.go:161 msg="error removing stale clients" err="empty ring"

@glebsa8
Copy link

glebsa8 commented Sep 29, 2020

Clearing previously registered ingesters from Consul fixed the issue for me. But I don't know how safe is this solution in production. I've executed

consul kv delete -recurse loki-collectors

here loki-collectors is a prefix from Loki config

ingester:
  lifecycler:
    ring:
      kvstore:
        store: consul
        prefix: loki-collectors/

@josephmilla
Copy link

Saw this same issue with an etcd setup. Had to do this in etcd to make it work

$ ETCDCTL_API=3 etcdctl del "" --from-key=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

7 participants