New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: read-only range request took too long with etcd 3.2.24 #70082
Comments
/kind bug |
/sig api-machinery |
/area etcd |
cc @jingyih |
Are you using SSD or HDD? (I think t2 and t3 instances could come with either SSD or HDD?) Do you see 'wal: sync duration of' warning message with etcd v3.2.18? |
@jingyih you might be right, turns out my launch template are using HDD (standard and not GP2). I'll retry with etcd 3.2.24 and report here. To answer your question I have no such error with etcd 3.2.18 |
kubernetes version: 1.12.1
|
Probablely the disk is too slow https://github.com/etcd-io/etcd/blob/master/Documentation/metrics.md#disk Etcd metrics in my cluster:
|
Hi, I tested with with gp2 ssd on AWS I have the same issue. I don’t have the same wal fsync duration error though. Some colleague of mine has the same issue with rancher and the same etcd version on ssd also. |
I’ll try with EBS optimized instance and also dedicated disk to rule out disk latency. The cluster seems to function normally event with etcd 3.2.24. |
Please check the |
@jpbetz here is what I have :
I'm not sure how ot read this |
I have the same issue with my cluster, didn't notice before I saw that issue opened.
Errors:
|
@ArchiFleKs It says of 354842 total operations, 228127 took less than or equal to .002 seconds, 348658 took less than or equal to .004 seconds,where the 348658 number includes those that took less than .002 seconds as well. There are a very small portion (85 to be exact) of disk backend commits taking over 128ms. I'm not well enough calibrated to say for sure if those number are out of healthy range or not, but they don't look particularly alarming. The |
Yes I agree for the original report, just to clarify, this is an empty cluster, with nothing on it, except a fresh cluster with kubeadm 1.12, and coredns on it. What I find very weird is the fact that I have no issue with the same config and etcd 3.18. I'll check the |
Hi,
|
@ArchiFleKs looks like there was one fsync that took between 2.048-4.096 seconds an two that took between 0.512-1.024 seconds. This would result in messages like the one you saw ("wal: sync duration of 2.067539108s, expected less than 1s" error"). https://github.com/etcd-io/etcd/blob/e8b940f268a80c14f7082589f60cbfd3de531d12/wal/wal.go#L572 both tallies this metric and logs the message. If you're seeing the log message at higher rates than the metric suggests, that might require further investigation, but the log message does appear to correctly telling us that there was excessively high disk latency. |
So, what's the suggestion of etcd version in 1.12.x ? 3.2.24 ? |
@lixianyang Yes, etcd 3.2.24 is recommended for k8s 1.12.x. |
is there any possible workaround? experiencing the same issue with azure, I also tried to run etcd on separated ssd disk, didn't help. |
@lixianyang if you still wonder, I tried 3.2.18 and it works better than 3.2.24, I don't see these read-only request took too long messages anymore. |
FYI, having the same issue with |
As an update here, I'm seeing this same Have we identified if this is due to the version of etcd? Currently 3.2.24 is the recommended etcd version for 1.12 and 1.13, with 1.14 updating to 3.3.10 (source) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle rotten |
Same error here, my colleague upgrade kubernetes cluster from 1.6.4 to 1.18.19, and this error occurs. Everything is working before upgrade. I can execute kubectl commands, but controller manager and scheduler are unhealthy, I do not know why. # kubectl get cs
NAME STATUS MESSAGE ERROR
controller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refused
scheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused
etcd-0 Healthy {"health":"true"} |
What does your disk situation look like? SSDs? |
@NorseGaud Not SSD... But I do not think it is the reason, this cluster was created half a year ago, and run healthily. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is there a setting that can be changed in order to increase this limit? |
Any solutions for this etcd issue? |
I'm seeing a lot of these messages too. The backend_commit_duration seems fine to me and neither IOPS or CPU usage are looking strange |
I am getting these messages also, and am locked out of the webUI. Running on raid 1 SSD. |
to quote myself where I already quoted myself:
so please state your sustained SSD Performance in IOPS. I would advice to not use some desktop SSDs because these tend to not be able to deliver sustained IOPS for a long time. HTH |
Hey everyone, so it seems like ext4 has a write barrier cache can cause crappy performance even on a fast SSD. Check this out https://medium.com/paypal-tech/scaling-kubernetes-to-over-4k-nodes-and-200k-pods-29988fad6ed
|
I fixed this issue by changing to a locally deployed VPN server. The ping times between masters changed from ~60ms to ~3ms and the alert is gone. |
Ok for performance, but this is not true that disabling write barrier has no effect on data loss ! The only condition were it's ok to do fsync writes with write barrier cache disabled is if you are using Battery-backed write cache. If this is your RAID controller that has a battery-backup-unit (BBU) or similar technology to protect the cache contents on power loss, make sure to disable the individual internal caches of the attached disks in the controller settings, as these are not protected by the RAID controller battery. However the people considering this relevant may be very marginal given fsync gate postgresql |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
When deploying a brand new cluster with either t2.medium (with 4Go RAM) instances or even t3.large (with 8Go RAM) I get errors in ETCD logs :
What you expected to happen:
I expect the logs to be exempt of errors.
How to reproduce it (as minimally and precisely as possible):
Launch a kubeadm cluster with Kubernetes version v1.12.1
Anything else we need to know?:
When downgrading to Kubernetes v1.11.3 there are no error anymore, also, staying in v1.12.1 and manually downgrading etcd to version v3.2.18 (which is ship with kubernetes v1.11.3) workaround the issue.
Environment:
kubectl version
): v.1.12.1uname -a
):Linux ip-10-0-3-11.eu-west-1.compute.internal 4.14.67-coreos #1 SMP Mon Sep 10 23:14:26 UTC 2018 x86_64 Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GenuineIntel GNU/Linux
The text was updated successfully, but these errors were encountered: