-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Odd timing behavior with Readiness Probes with initialDelaySeconds
and periodSeconds
#62036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have noted that the issue does not seem to occur when If I create the above deployment without the
|
timeoutSeconds
and initialDelaySeconds
initialDelaySeconds
and periodSeconds
I am also experiencing the same issue, and have been able to successfully reproduce it with @RochesterinNYC's minimal deployment manifest. With |
/sig node |
/assign |
Seems likely that this is caused by the jitter/sleep that kubelet does before sending the first probe: kubernetes/pkg/kubelet/prober/worker.go Line 118 in b8ab289
|
Noting also that the impetus for this issue is the described/intended behavior of readiness probes/liveness probes as dictated at https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
If there is something else that should be happening or the behavior observed in this issue is actually "expected" when these options are configured together, then that should potentially be called out in the documentation. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I don't think this should have been left to go stale because this issue is real and goes against how the probes are described in the docs. We have a probe that, due to API limits, can only happen every 600s, but that should not mean the first probe takes up-to 10m to happen after a pod first starts. I understand the logic about not wanting all probes to happen at once if the kubelet crashes, but in the case of a single pod creation it should not be a problem. |
/reopen I think this should be re-opened as I was about to create a new issue with essentially the same problem. I was expecting readiness probes (and presumably liveness probes as well) to execute the first probe immediately upon initialDelaySeconds expiring. Instead, it may take a substantial amount of time before the pod is marked "Ready" upon the expiration of the initial wait. The suggestion about the jitter calculation above looks promising as my results were very confusing from my experimentation with shorter/longer values for periodSeconds. The more I tested this the more confused I got. Using a simple All of the examples on the Configure Liveness and Readiness Probes page use an initial delay shorter than period seconds, and generally short values across the board. Text blocks like this:
leave it semi-ambiguous as to whether the probe is executed immediately at that 5 second mark or has some built-in delay (regardless of whether that delay is based on periodSeconds). |
@ankrause: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I am also running into this issue, the |
/reopen. |
It seems to me that something among the following would be needed: |
/reopen TL;DR: or Detail: So - why, despite our careful initContainer delays and careful Readiness probe delay setting, do we see issues with application cluster formation? kubernetes/pkg/kubelet/prober/worker.go Line 113 in b8ab289
`// run periodically probes the container. func (w *worker) run() { // If kubelet restarted the probes could be started in rapid succession. probeTicker := time.NewTicker(probeTickerPeriod) ` FYI:- rand.Float64 returns, as a float64, a pseudo-random number in the range [0.0 to 1.0] You can see it's not just the initial delay - its initial delay plus a random portion of the probe period (30sec in our case) Now we understand this we can set:
Apart from actually making the thing work more reliably, these changes don't change time taken the RabbitMQ cluster to become available to clients much, if at all. The cluster is 'up' once the 1st instance is running and marked ready (and the application is actually ready). So the 1st instance will be ready 35 to 65sec after pod deployment. The defaults of the (Bitnami) chart are actually sometimes worse as you often hit a situation where the 1st readiness probe fails and then you have to wait another 30sec before the next readiness probe (10+30+30). Please could I suggest to Kubernetes that the documentation is addresses and possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?) or expose the variability value so that, for time sensitive readienss gates the human can set to zero? Clarification about services, DNS records and ready probes: Because A or AAAA records are not created for Pod names, hostname is required for the Pod’s A or AAAA record to be created. A Pod with no hostname but with subdomain will only create the A or AAAA record for the headless service ( default-subdomain.my-namespace.svc.cluster-domain.example ), pointing to the Pod’s IP address. Also, Pod needs to become ready in order to have a record unless publishNotReadyAddresses=True is set on the Service. |
@JamesTGrant: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Same problema here |
Same problem. It's not possible to get a fast startup with simultaneous slower time interval of health checks. What's the reason for this restriction or why is a value for initialDelaySeconds ignored if it's lower than the value for periodSeconds? Is there any workaround? |
The issue is already 6 years old and hasn't been addressed at all. |
May be related to kubernetes/website#48519 |
/kind bug
What happened:
A deployment with a readiness probe configured with a command that is guaranteed to instantly succeed with exit code 0 (the command
true
), aninitialDelaySeconds
of 10, and aperiodSeconds
of 300 has its pod goREADY
after the pod has already been marked asRunning
forSTATUS
for at least a minute.What you expected to happen:
The deployment with a readiness probe configured with the settings delineated above has the readiness probe execute roughly about 10 seconds after the pod goes into
Running
forSTATUS
. The pod becomes1/1
forREADY
right after this. I would expect that it does not take between 1-5 minutes for the pod to becomeREADY
after the container/pod has already been marked asRUNNING
forSTATUS
.How to reproduce it (as minimally and precisely as possible):
I was able to reproduce it like so with the following minimal
deployment.yaml
:I created a fresh cluster on GKE (version is the latest available for GKE:
1.9.4-gke.1
) in theus-central1-a
zone with one n1-standard-2 node:I targeted the cluster via gcloud/kubectl under the hood:
$ gcloud container clusters get-credentials readiness-probe-test-cluster-latest --zone us-central1-a Fetching cluster endpoint and auth data. kubeconfig entry generated for readiness-probe-test-cluster-latest.
I set up watching for pods via (in two separate terminal windows:
I deployed the deployment via:
$ kubectl create -f deployment.yaml deployment "nginx" created
I observe that the pod goes into
Running
STATUS
very quickly:The image is pulled and the container started very, very quickly (under 15 seconds slowest):
But the actual pod does not go into
1/1
STATUS
until after at least a minute. Various times have been observed ranging from about 1 minute to 5 minutes, but it goes into 1/1 status very rarely under at least a minute.Anything else we need to know?:
time
arollout status
command after:time
ingkubectl rollout status
:true
) for reproduction purposes. I gave it an egregiously largetimeoutSeconds
value of 30 as well for this purpose. TheperiodSeconds
time of 300 is to emulate our real-life use case with a readiness probe that is more costly resources/rate-limiting wise.Environment:
kubectl version
):uname -a
):Linux nginx-6c7794546b-wb7dl 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: