-
Notifications
You must be signed in to change notification settings - Fork 40.6k
[test flakes] master-scalability suites #60589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'll try getting to them sometime during this week (having scarcity of free cycles atm). |
@shyamjvs is there any update for this issue? |
I took a brief look into that. And either some test(s) are extreeeeemely slow or something is hanging somehwere. Par of logs from last run:
No test finished within 8h30m |
Indeed seems like a regression. I think the regression happened somewhere between runs: We can try looking into kubemark-5000 to see if it is visible there too. |
Regarding the corretness tests - gce-large-correctness is also failing. |
Thanks a lot for looking @wojtek-t. Wrt performance job - I too feel strongly there's a regression (though I couldn't get to look properly into them).
I was looking into this a while ago. And there were 2 suspicious changes I found:
|
cc @kubernetes/sig-storage-bugs |
/assign Some of the local storage tests will try to use every node in the cluster, thinking that cluster sizes are not that big. I'll add a fix to cap the max number of nodes. |
Thanks @msau42 - that would be great. |
Going back to https://k8s-testgrid.appspot.com/sig-release-master-blocking#gce-scale-performance suite I took a closer look into runs up to 105 and runs 108 and after that.
[the name of it is misleading - will explain below] Up to 105 run, it generally looked like this:
Starting with 108 run, it looks more like:
That basically means ~0.85s increase and this roughly what we observe in the end result. Now - what that "watch lag" is.
Since we don't really observe a difference between "schedule -> start" of a pod, that suggests that it most probably isn't apiserver (because processing requests and watch is on that path too), and it's most probably not slow kubelet too (because it starts the pod). So i think the most likely hypothesis is:
Test didn't change at all around that time. So I think it's most probably the first one. That said, I went through PRs merged between 105 and 108 runs and didn't find anything useful so far. |
I think the next step is to:
|
So I looked into example pods. And I'm already seeing this:
So it seems pretty clear that the problem is related to "429"s. |
Are those throttled API calls due to a quota on the owner account? |
This isn't throttling as I thought initially. These are 429s on apiserver (the reason may be either slower apiserver because of some reason, or more requests coming to apiserver). |
Oh, ok. That's not great news. |
@shyamjvs How etcd is configured? We've increased default value |
I still think letting kubelet set the |
Also, revived the really old issue for adding tests for steady-state pod update rate #14391 |
@yujuhong - are you talking about this one: #61504 (or do I misunderstand it)? @wasylkowski @shyamjvs - can you please run 5000-node tests with that PR patched locally (before we merge it) to ensure that this really helps? |
I ran the test against 1.10 HEAD + #61504, and the pod-startup latency seems to be fine:
Will re-run once more to confirm. |
@shyamjvs - thanks a lot! |
…-phases-as-metrics Automatic merge from submit-queue (batch tested with PRs 61378, 60915, 61499, 61507, 61478). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Capture pod startup phases as metrics Learning from kubernetes#60589, we should also start collecting and graphing sub-parts of pod-startup latency. /sig scalability /kind feature /priority important-soon /cc @wojtek-t ```release-note NONE ```
Second run also seems good:
Fairly confident now the fix did the trick. Let's get it into 1.10 asap. |
Thanks @shyamjvs As we talked offline - I think we had one more regression in the last month or so, but that one shouldn't block the release. |
Yep. The current fix in that PR is not in the options proposed initially in #60589 (comment) |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Fix `PodScheduled` bug for static pod. Fixes #60589. This is an implementation of option 2 in #60589 (comment). I've validated this in my own cluster, and there won't be continuously status update for static pod any more. Signed-off-by: Lantao Liu <lantaol@google.com> **What this PR does / why we need it**: **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes # **Special notes for your reviewer**: **Release note**: ```release-note none ```
re-opening until we have a good performance test result. |
@yujuhong @krzyzacy @shyamjvs @wojtek-t @Random-Liu @wasylkowski-a any updates on this? This is still blocking 1.10 at the moment. |
So the only part of this bug that was blocking the release is the 5k-node performance job. Unfortunately, we lost our run from today due to a different reason (ref: #61190 (comment)) That said, we're fairly confident the fix works based on my manual runs (results pasted in #60589 (comment)). So IMHO we don't need to block release on it (the next run's going to be on wed). |
+1 |
Sorry, I edited my post above. I meant that we should treat it as "non-blocker". |
Ok, thank you very much. This conclusion represents a tremendous amount of hours you have invested, and I cannot possibly thank you all enough for the work you have done. While we talk in the abstract about "community" and "contributors" you, and the others who have worked this issue represent it in concrete terms. You are the very heart and soul of this project, and I know I speak for everyone involved when I say it is an honor to work alongside such passion, commitment, and professionalism. |
[MILESTONENOTIFIER] Milestone Issue: Up-to-date for process @krzyzacy @msau42 @shyamjvs @wojtek-t Issue Labels
|
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update default etcd server to 3.2 for kubernetes 1.11 Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (#60589 (comment)). See #60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k. ```release-note Upgrade the default etcd server version to 3.2.18 ``` @gyuho @shyamjvs @jdumars @timothysc
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update default etcd server to 3.2 for kubernetes 1.11 Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)). See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k. ```release-note Upgrade the default etcd server version to 3.2.18 ``` @gyuho @shyamjvs @jdumars @timothysc Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update default etcd server to 3.2 for kubernetes 1.11 Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)). See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k. ```release-note Upgrade the default etcd server version to 3.2.18 ``` @gyuho @shyamjvs @jdumars @timothysc Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
This issue was resolved with the relevant fixes in for 1.10. /close |
Automatic merge from submit-queue (batch tested with PRs 60891, 60935). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Rollback etcd server version to 3.1.11 due to #60589 Ref kubernetes/kubernetes#60589 (comment) The dependencies were a bit complex (so many things relying on it) + the version was updated to 3.2.16 on top of the original bump. So I had to mostly make manual reverting changes on a case-by-case basis - so likely to have errors :) /cc @wojtek-t @jpbetz ```release-note Downgrade default etcd server version to 3.1.11 due to #60589 ``` (I'm not sure if we should instead remove release-notes of the original PRs) Kubernetes-commit: 56195fd1d329e8fb6c3c6cba59e1bc1eb4a2c998
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Update default etcd server to 3.2 for kubernetes 1.11 Repply #59836 but with latest etcd 3.2 patch version (3.2.18 which includes mvcc fix and leader election timeout fix) and default `--snapshot-count` to 10k to resolve performance regression in previous etcd 3.2 server upgrade attempt (kubernetes/kubernetes#60589 (comment)). See kubernetes/kubernetes#60589 (comment) for details on the root cause of the performance regression and scalability test results of setting `--snapshot-count` to 10k. ```release-note Upgrade the default etcd server version to 3.2.18 ``` @gyuho @shyamjvs @jdumars @timothysc Kubernetes-commit: 9816b431886e356dfc96873392dd3ac66b5f0fe7
Failing release-blocking suites:
all three suites are flaking a lot recently, mind triage?
/sig scalability
/priority failing-test
/kind bug
/status approved-for-milestone
cc @jdumars @jberkus
/assign @shyamjvs @wojtek-t
The text was updated successfully, but these errors were encountered: