Skip to content

Ingester not leaving Ring properly again #1159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mizeng opened this issue Oct 16, 2019 · 17 comments
Closed

Ingester not leaving Ring properly again #1159

mizeng opened this issue Oct 16, 2019 · 17 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@mizeng
Copy link
Contributor

mizeng commented Oct 16, 2019

Describe the bug
In kubernetes world, if I delete one Ingester pod, the new one will join the Cortex ring, however the old one still exists in the ring with status "unhealthy". This will cause 500 error in Distributor for unhealthy ingester.

To Reproduce
Steps to reproduce the behavior:

  1. Started several Loki Ingesters.
  2. When they join the Cortex ring, delete one Ingester.
  3. Let the new Ingester comes up, and show up in the Cortext ring.

Error info found in Distributor

level=error ts=2019-10-16T02:47:56.023893394Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"

Expected behavior
Ingester should leave Ring properly.

Environment:

  • Infrastructure: Kubernetes
  • 1 Distributor + 6 Ingesters, replication_factor: 1

Screenshots for Cortex Ring
Screen Shot 2019-10-16 at 10 27 38 AM

@mizeng
Copy link
Contributor Author

mizeng commented Oct 16, 2019

Seems related to below logic

maxErrors := r.cfg.ReplicationFactor / 2

	for _, ingester := range r.ringDesc.Ingesters {
		if !r.IsHealthy(&ingester, Read) {
			maxErrors--
			continue
		}
		ingesters = append(ingesters, ingester)
	}

	if maxErrors < 0 {
		return ReplicationSet{}, fmt.Errorf("too many failed ingesters")
	}

So for my settings replication_factor: 1, maxErrors=0, then during loop, it will find one unhealthy ingester, then maxErrors=-1. Then the error too many failed ingesters appears, and block pool.removeStaleClients.

In short, if replication_factor is 1, there's always be this kind of issue "Ingester can not leaving Ring". And have to manually click "forget".

@mizeng
Copy link
Contributor Author

mizeng commented Oct 16, 2019

However even if I set replication_factor=2, when I delete one ingester, the error disappears msg="error removing stale clients" err="too many failed ingesters", but distributor did not delete the unhealthy ingester from Cortex Ring either.

@rfratto
Copy link
Member

rfratto commented Oct 16, 2019

Do you have any logs from ingester-7dff74c688-9mgsl that show why it may not have left the ring properly?

Also, what is the terminationGracePeriod set to for your ingesters? We run ours with a termination grace period of 4800 to make sure that the ingesters have enough time to cleanly shut down and leave the ring.

@mizeng
Copy link
Contributor Author

mizeng commented Oct 17, 2019

The pod ingester-7dff74c688-9mgsl is deleted from the cluster, so I have no logs. However, per my understandings, Distributor is responsible for remove one ingester from the ring, right?

In other hand, I think the left Ingester has finished its work, set its status to unhealthy, wait the distributor to delete it. However due to current like-bug logic, it will hang for delete forever.

@rfratto
Copy link
Member

rfratto commented Oct 17, 2019

The pod ingester-7dff74c688-9mgsl is deleted from the cluster, so I have no logs.

If you have Promtail monitoring the Loki components, you can use Loki to read logs from the deleted pod. This is how we tend to diagnose these kinds of issues in production.

However, per my understandings, Distributor is responsible for remove one ingester from the ring, right?

Not quite. Ingesters handle their own insertion and deletion from the ring. When an ingester is shut down from k8s, after the shutdown process completes (i.e., either handing off its chunks to a pending ingester or flushing chunks to the store), it will update the ring in Consul and remove its own definition.

Distributors only read the ring to find which subset of ingesters should receive data for a particular stream.

@mizeng
Copy link
Contributor Author

mizeng commented Oct 17, 2019

Thanks for the info!

I will look through the Ingester code again to triage.

@mizeng
Copy link
Contributor Author

mizeng commented Oct 17, 2019

I've tried to enlarge "terminationGracePeriodSeconds" to give Ingester more time to finish transfer work during shutting down process. However still get below error:

level=info ts=2019-10-17T02:41:36.88751461Z caller=signals.go:54 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
level=info ts=2019-10-17T02:41:36.887868532Z caller=loki.go:175 msg="notifying module about stopping" module=ingester
level=info ts=2019-10-17T02:41:36.887962795Z caller=loki.go:175 msg="notifying module about stopping" module=store
level=info ts=2019-10-17T02:41:36.88798284Z caller=loki.go:175 msg="notifying module about stopping" module=overrides
level=info ts=2019-10-17T02:41:36.887994483Z caller=loki.go:175 msg="notifying module about stopping" module=server
level=info ts=2019-10-17T02:41:36.888007006Z caller=loki.go:157 msg=stopping module=ingester
level=info ts=2019-10-17T02:41:36.888107629Z caller=lifecycler.go:476 msg="changing ingester state from" old_state=ACTIVE new_state=LEAVING
level=warn ts=2019-10-17T02:41:36.919255013Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=36.269µs err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:41:36.920469276Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=23.758µs err="Ingester is shutting down" msg="gRPC\n"
level=error ts=2019-10-17T02:41:36.931286753Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=warn ts=2019-10-17T02:41:36.935355383Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=28.733µs err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:41:36.944533693Z caller=grpc_logging.go:37 duration=31.231µs method=/logproto.Pusher/Push err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:41:37.039120867Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=42.454µs err="Ingester is shutting down" msg="gRPC\n"
level=error ts=2019-10-17T02:41:37.122320289Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:37.394799396Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:39.785721826Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:41.69763919Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:43.980020725Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:48.211619621Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:51.658072712Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:41:56.115294403Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:42:00.278117785Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:42:00.278163917Z caller=lifecycler.go:485 msg="Failed to transfer chunks to another ingester" err="terminated after 10 retries"
rpc error: code = Unknown desc = an error occurred when try to find container "8b9b060e3ed3b387cf3c7fce5ab62ec8a5650b5fdf3777530d2e2519d6422261": does not exist

Could we have below 2 lines configurable (MinBackoff, MaxBackOff)?
https://github.com/grafana/loki/blob/master/pkg/ingester/transfer.go#L142-L143

This made too short for Ingester to try to transfer its data to other ingester, hence new Ingester usually will need quite a few secs to start up.

func (i *Ingester) TransferOut(ctx context.Context) error {
	backoff := util.NewBackoff(ctx, util.BackoffConfig{
		MinBackoff: 100 * time.Millisecond,
		MaxBackoff: 5 * time.Second,
		MaxRetries: i.cfg.MaxTransferRetries,
	})

@mizeng
Copy link
Contributor Author

mizeng commented Oct 17, 2019

Increase MaxTransferRetries=200, still get error:

level=info ts=2019-10-17T02:50:35.552227013Z caller=lifecycler.go:476 msg="changing ingester state from" old_state=ACTIVE new_state=LEAVING
level=warn ts=2019-10-17T02:50:35.556979119Z caller=grpc_logging.go:37 duration=60.331µs method=/logproto.Pusher/Push err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:50:35.557841863Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=21.16µs err="Ingester is shutting down" msg="gRPC\n"
level=error ts=2019-10-17T02:50:35.587792449Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=warn ts=2019-10-17T02:50:35.619469858Z caller=grpc_logging.go:37 duration=31.471µs method=/logproto.Pusher/Push err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:50:35.634755385Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=29.345µs err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:50:35.726500033Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=31.265µs err="Ingester is shutting down" msg="gRPC\n"
level=error ts=2019-10-17T02:50:35.728427343Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=warn ts=2019-10-17T02:50:35.730688923Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=20.325µs err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:50:35.731236672Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=29.376µs err="Ingester is shutting down" msg="gRPC\n"
level=warn ts=2019-10-17T02:50:35.926145016Z caller=grpc_logging.go:37 method=/logproto.Pusher/Push duration=41.565µs err="Ingester is shutting down" msg="gRPC\n"
...
level=error ts=2019-10-17T02:55:34.098410311Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
rpc error: code = Unknown desc = an error occurred when try to find container "c8c81d7cbcc1375e122a4edf09bfd5353dadbcbfe08c3f6ce037947287766043": does not exist

@mizeng
Copy link
Contributor Author

mizeng commented Oct 17, 2019

So I think the new start-up Ingester go pass "Pending" state too soon, which cause the old Ingester can not find a "Pending" state Ingester to transfer. Try to increase "join-after" for Ingester stay more time in "Pending" State.

@mizeng
Copy link
Contributor Author

mizeng commented Oct 17, 2019

Increase "join-after" to 30s, manually test deleting one pod for several times, and Ingester removed from consul normally.

level=error ts=2019-10-17T07:47:45.04502428Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T07:47:49.733248524Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T07:47:54.635613838Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T07:47:58.687537256Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T07:48:03.088022Z caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=info ts=2019-10-17T07:48:07.978962134Z caller=transfer.go:157 msg="sending chunks" to_ingester=10.15.100.94:9095
level=info ts=2019-10-17T07:48:38.068920049Z caller=transfer.go:218 msg="successfully sent chunks" to_ingester=10.15.100.94:9095
level=info ts=2019-10-17T07:49:08.100882187Z caller=lifecycler.go:374 msg="ingester removed from consul"
level=info ts=2019-10-17T07:49:08.100958304Z caller=lifecycler.go:298 msg="member.loop() exited gracefully"
level=info ts=2019-10-17T07:49:08.100995483Z caller=loki.go:157 msg=stopping module=store
level=info ts=2019-10-17T07:49:08.101083783Z caller=loki.go:157 msg=stopping module=overrides
level=info ts=2019-10-17T07:49:08.101121924Z caller=loki.go:157 msg=stopping module=server

@rfratto
Copy link
Member

rfratto commented Oct 17, 2019

Yes, the join-after should be set to a period that's high enough for the leaving ingester to discover the joining ingester.

It's interesting why it's not exiting from consul when it doesn't transfer successfully; it's definitely supposed to. I would expect something like this to be in the logs, followed by a flush and removal form consul:

level=error ts=2019-10-17T02:55:34 caller=transfer.go:144 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
level=error ts=2019-10-17T02:55:34 caller=lifecycler.go:479 msg="Failed to transfer chunks to another ingester" err="cannot find ingester to transfer chunks to: no pending ingesters"

I'm suspecting there might be a crash somewhere when the ingesters falls back to doing a flush on exit, and it might happen too quick for it to show up in the Docker logs. What are you using for the chunk and index store?

@mizeng
Copy link
Contributor Author

mizeng commented Oct 18, 2019

Chunk: Ceph Obj Store with S3 API
Index: ElasticSearch

@HuippuJanne
Copy link

With loki:v0.4.0 it's possible to use flag "ingester.max-transfer-retries: 0" which somebody recommended playing around with this error in Loki slack chat. DON'T DO THAT; the error is still happening but you cannot even recover by deleting the ring (e.g. curl -XDELETE localhost:8500/v1/kv/collectors/ring).

By this day, the only known way to work around this annoying, constantly happening defect is to:

  • set more retries e.g. "ingester.max-transfer-retries: 2"
  • set up a background process polling the ingester logs, and when it goes to this weird state with consul, issue the ring deletion

@stale
Copy link

stale bot commented Dec 13, 2019

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Dec 13, 2019
@stale stale bot closed this as completed Dec 20, 2019
@createdanew
Copy link

So was there a definitive answer to this? I haven't been able to discover a proper solution. I'm also having issues with "Unhealthy" ingestors not leaving the ring.

@sfro
Copy link

sfro commented Feb 20, 2020

We are having the same issue with version v1.3.0. Currently playing around with some of the parameters mentioned in this issue.

@senior88oqz
Copy link

senior88oqz commented Mar 27, 2020

Anyone have a good way to get around this? Experiencing a similar problem when ingester pods are killed unexpectedly (i.e, OOM Killed). Could only bring them back to healthy state by delete and redeploy the ring

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

6 participants