too many failed ingesters

**Bug**

When instances unexpecedetly leave a consul-ring and rejoin, the instances won't work anymore until the consul-ring is deleted. 


**To Reproduce**
Steps to reproduce the behavior:
1. Started two Loki 1.5.0 containers in monolith-mode. They use the same config (see below) and use Consul as ring. They use the new boltdbshipper store and filesystem, however the problem existed in loki 1.4.0, too.
    * `docker run -p 3100:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml`
    * `docker run -p 3101:3100 -v /tmp/loki/:/tmp/loki/ -v /root/loki/config.yml:/config.yaml grafana/loki:1.5.0 -config.file=/config.yaml`
2. kill one container with `docker kill`
3. A new container automatically (trough nomad) comes up and joins the cluster
4. After joining the cluster and querying **either** one of the instances, both fail with:
```
level=error ts=2020-05-26T10:43:53.883196602Z caller=pool.go:161 msg="error removing stale clients" err="too many failed ingesters"
```

5. To fix the issue, I have to delete the ring in consul. Then both instances auto-join again and everything works.

Same issue as described here: https://github.com/grafana/loki/issues/1159#issuecomment-604761797 and here https://github.com/grafana/loki/issues/660

**Expected behavior**
The instances continue to work and do not fail.

**Environment:**
 - Infrastructure: containers on bare-metal
 - Deployment tool: nomad

**Config**

```
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  max_transfer_retries: 0 # Disable blocks transfers on ingesters shutdown or rollout.
  chunk_idle_period: 2h # Let chunks sit idle for at least 2h before flushing, this helps to reduce total chunks in store
  max_chunk_age: 2h  # Let chunks get at least 2h old before flushing due to age, this helps to reduce total chunks in store
  chunk_target_size: 1048576 # Target chunks of 1MB, this helps to reduce total chunks in store
  chunk_retain_period: 30s

  lifecycler:
    join_after: 5s
    ring:
      kvstore:
        store: consul
        consul:
          host: "nomad-servers.service.consul:8500"
      replication_factor: 1

schema_config:
  configs:
    - from: 2018-04-15
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 1680h

storage_config:
  boltdb_shipper:
    shared_store: filesystem
    active_index_directory: /tmp/loki/index
    cache_location: /tmp/loki/boltdb-cache

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 1680h
  # Per-user ingestion rate limit in sample size per second. Units in MB.
  ingestion_rate_mb: 8
  # Per-user allowed ingestion burst size (in sample size). Units in MB.
  # The burst size refers to the per-distributor local rate limiter even in the
  # case of the "global" strategy, and should be set at least to the maximum logs
  # size expected in a single push request.
  ingestion_burst_size_mb: 16
chunk_store_config:
  max_look_back_period: 0s  # No limit how far we can look back in the store

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  retention_deletes_enabled: true
  retention_period:  1680h
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

too many failed ingesters #2131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

too many failed ingesters #2131

Description

Activity

cyriltovena commented on May 27, 2020

rndmh3ro commented on May 27, 2020

rndmh3ro commented on May 27, 2020

cyriltovena commented on May 27, 2020

pstibrany commented on May 27, 2020

rndmh3ro commented on May 28, 2020

pstibrany commented on May 28, 2020

rndmh3ro commented on May 28, 2020

dginther commented on Jun 10, 2020

pstibrany commented on Jun 10, 2020

pstibrany commented on Jun 10, 2020

rndmh3ro commented on Jun 11, 2020

stale commented on Jul 11, 2020

iamparamvs commented on Sep 16, 2020

glebsa8 commented on Sep 29, 2020

josephmilla commented on Nov 20, 2020

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions