Description
/kind bug
What happened:
A deployment with a readiness probe configured with a command that is guaranteed to instantly succeed with exit code 0 (the command true
), an initialDelaySeconds
of 10, and a periodSeconds
of 300 has its pod go READY
after the pod has already been marked as Running
for STATUS
for at least a minute.
What you expected to happen:
The deployment with a readiness probe configured with the settings delineated above has the readiness probe execute roughly about 10 seconds after the pod goes into Running
for STATUS
. The pod becomes 1/1
for READY
right after this. I would expect that it does not take between 1-5 minutes for the pod to become READY
after the container/pod has already been marked as RUNNING
for STATUS
.
How to reproduce it (as minimally and precisely as possible):
I was able to reproduce it like so with the following minimal deployment.yaml
:
apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
readinessProbe:
exec:
command:
- 'true'
periodSeconds: 300
initialDelaySeconds: 10
timeoutSeconds: 30
I created a fresh cluster on GKE (version is the latest available for GKE: 1.9.4-gke.1
) in the us-central1-a
zone with one n1-standard-2 node:
$ gcloud container clusters create readiness-probe-test-cluster-latest --num-nodes=1 --machine-type n1-standard-2 --zone us-central1-a --cluster-version=1.9.4-gke.1
Creating cluster readiness-probe-test-cluster-latest...done.
Created [https://container.googleapis.com/v1/projects/operation-covfefe-1/zones/us-central1-a/clusters/readiness-probe-test-cluster-latest].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-a/readiness-probe-test-cluster-latest?project=operation-covfefe-1
kubeconfig entry generated for readiness-probe-test-cluster-latest.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
readiness-probe-test-cluster-latest us-central1-a 1.9.4-gke.1 35.192.183.106 n1-standard-2 1.9.4-gke.1 1 RUNNING
I targeted the cluster via gcloud/kubectl under the hood:
$ gcloud container clusters get-credentials readiness-probe-test-cluster-latest --zone us-central1-a
Fetching cluster endpoint and auth data.
kubeconfig entry generated for readiness-probe-test-cluster-latest.
I set up watching for pods via (in two separate terminal windows:
$ watch -n 1 kubectl get pods
No resources found.
$ watch -n 1 kubectl describe pods
I deployed the deployment via:
$ kubectl create -f deployment.yaml
deployment "nginx" created
I observe that the pod goes into Running
STATUS
very quickly:
NAME READY STATUS RESTARTS AGE
nginx-6c7794546b-2dwf5 0/1 Running 0 7s
The image is pulled and the container started very, very quickly (under 15 seconds slowest):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned nginx-6c7794546b-2dwf5 to gke-readiness-probe-test-default-pool-0f78aa38-67j1
Normal SuccessfulMountVolume 11s kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1 MountVolume.SetUp succeeded for volume "default-token-vqxhc"
Normal Pulling 10s kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1 pulling image "nginx"
Normal Pulled 6s kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1 Successfully pulled image "nginx"
Normal Created 6s kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1 Created container
Normal Started 6s kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1 Started container
But the actual pod does not go into 1/1
STATUS
until after at least a minute. Various times have been observed ranging from about 1 minute to 5 minutes, but it goes into 1/1 status very rarely under at least a minute.
NAME READY STATUS RESTARTS AGE
nginx-6c7794546b-2dwf5 1/1 Running 0 1m
Anything else we need to know?:
- A less visual way to observe the issue is to create the deployment and then immediately
time
arollout status
command after:
$ kubectl create -f deployment.yaml
deployment "nginx" created
$ time kubectl rollout status deploy/nginx
Waiting for rollout to finish: 0 of 1 updated replicas are available...
deployment "nginx" successfully rolled out
real 4m7.364s
user 0m0.174s
sys 0m0.027s
- The timings from a few other reproduction runs like above with
time
ingkubectl rollout status
:
real 4m5.264s
user 0m0.142s
sys 0m0.026s
real 3m54.862s
user 0m0.168s
sys 0m0.027s
real 1m3.162s
user 0m0.154s
sys 0m0.024s
real 2m52.104s
user 0m0.144s
sys 0m0.024s
- I specifically picked a readiness probe command that returns extremely fast/almost immediately (
true
) for reproduction purposes. I gave it an egregiously largetimeoutSeconds
value of 30 as well for this purpose. TheperiodSeconds
time of 300 is to emulate our real-life use case with a readiness probe that is more costly resources/rate-limiting wise.
Environment:
- Kubernetes version (use
kubectl version
):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:34:11Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9+", GitVersion:"v1.9.4-gke.1", GitCommit:"10e47a740d0036a4964280bd663c8500da58e3aa", GitTreeState:"clean", BuildDate:"2018-03-13T18:00:36Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: Google Kubernetes Version
- OS (e.g. from /etc/os-release): COS
- Kernel (e.g.
uname -a
):Linux nginx-6c7794546b-wb7dl 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 GNU/Linux
Activity
RochesterinNYC commentedon Apr 2, 2018
I have noted that the issue does not seem to occur when
periodSeconds
is not specified.If I create the above deployment without the
periodSeconds
key/value forreadinessProbe
specified,time
ing thekubectl rollout status deploy/nginx
immediately afterkubectl create -f deployment.yaml
results in timings that seem more fitting/expected:[-]Odd timing behavior with Readiness Probes with `timeoutSeconds` and `initialDelaySeconds`[/-][+]Odd timing behavior with Readiness Probes with `initialDelaySeconds` and `periodSeconds`[/+]ghost commentedon Apr 2, 2018
I am also experiencing the same issue, and have been able to successfully reproduce it with @RochesterinNYC's minimal deployment manifest. With
periodSeconds
set to 300, my replicaset took over a minute to reach1/1
, yet withoutperiodSeconds
specified, my replicaset reached1/1
in under 5 seconds.dims commentedon Apr 4, 2018
/sig node
yguo0905 commentedon Apr 24, 2018
/assign
mattnworb commentedon Apr 26, 2018
Seems likely that this is caused by the jitter/sleep that kubelet does before sending the first probe:
kubernetes/pkg/kubelet/prober/worker.go
Line 118 in b8ab289
RochesterinNYC commentedon Apr 27, 2018
Noting also that the impetus for this issue is the described/intended behavior of readiness probes/liveness probes as dictated at https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
If there is something else that should be happening or the behavior observed in this issue is actually "expected" when these options are configured together, then that should potentially be called out in the documentation.
fejta-bot commentedon Jul 26, 2018
Issues go stale after 90d of inactivity.
Mark the issue as fresh with
/remove-lifecycle stale
.Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with
/close
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
fejta-bot commentedon Aug 25, 2018
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with
/remove-lifecycle rotten
.Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with
/close
.Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
4 remaining items
georgekaz commentedon Apr 11, 2019
I don't think this should have been left to go stale because this issue is real and goes against how the probes are described in the docs. We have a probe that, due to API limits, can only happen every 600s, but that should not mean the first probe takes up-to 10m to happen after a pod first starts. I understand the logic about not wanting all probes to happen at once if the kubelet crashes, but in the case of a single pod creation it should not be a problem.
ankrause commentedon May 31, 2019
/reopen
/remove-lifecycle rotten
I think this should be re-opened as I was about to create a new issue with essentially the same problem.
I was expecting readiness probes (and presumably liveness probes as well) to execute the first probe immediately upon initialDelaySeconds expiring. Instead, it may take a substantial amount of time before the pod is marked "Ready" upon the expiration of the initial wait. The suggestion about the jitter calculation above looks promising as my results were very confusing from my experimentation with shorter/longer values for periodSeconds. The more I tested this the more confused I got. Using a simple
ls
command, initial delay of 30s, and period of 200s, my pod entered the ready state at approximately 2m30s after the container start event. Conversely, an initial delay of 0 with a wait of 3600 essentially never becomes ready, despite the command being guaranteed to succeed.All of the examples on the Configure Liveness and Readiness Probes page use an initial delay shorter than period seconds, and generally short values across the board. Text blocks like this:
leave it semi-ambiguous as to whether the probe is executed immediately at that 5 second mark or has some built-in delay (regardless of whether that delay is based on periodSeconds).
k8s-ci-robot commentedon May 31, 2019
@ankrause: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
mzmudziak commentedon Jul 5, 2019
I am also running into this issue, the
readinessProbe
doesn't start for quite a long time (longer than theinitialDelaySeconds
) if theperiodSeconds
is set to a higher value.How exactly is this issue closed already?
initialDelaySeconds
andperiodSeconds
#80431tmmorin commentedon Oct 15, 2020
/reopen.
/remove-lifecycle rotten
tmmorin commentedon Oct 15, 2020
It seems to me that something among the following would be needed:
a) update documentation to warn users about using an initialDelaySeconds lower than periodSeconds, and that all timers are only honored approximately because of a random timer of up to periodSeconds is being added
b) update kubelet code to use a lower jitter bounded by initialDelaySeconds for the first probe
c) prevent users from using a initialDelaySeconds lower than periodSeconds
JamesTGrant commentedon Jan 8, 2021
/reopen
/remove-lifecycle rotten
TL;DR:
Please could I suggest to Kubernetes that
a) the documentation is addressed and
b) possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?)
or
c) expose the 'variability' in the readiness probe sleep so that, for time sensitive readienss gates the human can set to zero?
Detail:
I've spent this week looking at instances where we sometimes see a statefulset fail to cluster properly on initial deployment in our wider application
• We have a statefulset and 3 instances which deploy one after the other. We use Parallel rather then OrderedReady to remove dependcies on lower numbered instances. We use an init container to 'sleep', the amount of sleep is based on the pod instance name, instance 0 slepps 0 seconds, instance 1 sleeps n1 seconds, onstance 2 sleeps n2 seconds.
• When the statefulset deploys all pods appear in the endpoint list as notReady and at the point where the pod passes its readiness probe it is marked ready and the enters service rotation (more/better precise terminology for ‘service rotation’ later – just go with it!).
• Peer discovery in application pod doesn't actually care if endpoints are ready or not. When application peer discovery occurs it gets the list of endpoints and tries to talk to everything in the list. Whether or not those endpoints are ready or not it'll try and talk to them anyway. However, you can't talk to a peer pod that is notReady because it's not in service rotation yet, even if you ask for it using the pod’s FQDN (even if the application running in the pod was capable/ready at the point a peer wanted to talk).
• When instance 0 deploys there are 2 peers but none will talk to it (they are not in service rotation, and the application pods are not even deployed yet!) so the application attempt to chat to the peers fails so instance 0 forms a cluster on its own.
• When instance 0 deploys there is a period of time before it is 'ready' according to the readiness probe.
• When instance 1 deploys it sees there are 2 peers. Instance 2 is not yet deployed (container not even started yet) but instance 0 pod is running. At this point the pod must be running and ready otherwise the peering will fail, and instance 1 will form a cluster all by itself – which is bad. So we must ensure that the readiness probe passes for instance 0 before instance 1 attempts to peer. I've looked carefully at the application source code (happens to be RabbitMQ in this case) - during the initial peering process, if it attempts to talk to a peer in the peer list and it isn't possible to establish a TCP connection then it fails immediately. No timeouts, no retries. There are lots of reasons why it may not be possible to establish a connection (DNS outage etc etc) but if the peer isn't in service rotation (is not ready) then it'll certainly fail.
• When instance 2 deploys it sees there are 2 peers. By this time it's pretty much certain that instance 0 has entered the ready state and so 2 can peer with 0. It’s unlikely that 1 is ready yet so it’ll peer with instance 0. Sometimes, if instance 1 is ready then it may pick instance 1 rather than 0
End result:
0 and 2 are co-clustered, 1 is in its own cluster - Rabbit cluster is partitioned and will never 'heal' at this point because instance 1 thinks it's doing the right thing and there's nothing to heal and instance 0 and 2 were never aware of 1.
Or
1 and 2 are co-clustered, 0 is in its own cluster - Rabbit cluster is partitioned and will never 'heal' at this point because instance 0 thinks it's doing the right thing and there's nothing to heal and instance 1 and 2 were never trying to cluster with 0.
So - why, despite our careful initContainer delays and careful Readiness probe delay setting, do we see issues with application cluster formation?
In the k8s kubelet source code there is this little bit of code before the first readiness probe is sent after starting a pod.
kubernetes/pkg/kubelet/prober/worker.go
Line 113 in b8ab289
`// run periodically probes the container.
func (w *worker) run() {
probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second
// If kubelet restarted the probes could be started in rapid succession.
// Let the worker wait for a random portion of tickerPeriod before probing.
time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))
probeTicker := time.NewTicker(probeTickerPeriod)
`
FYI:- rand.Float64 returns, as a float64, a pseudo-random number in the range [0.0 to 1.0]
You can see it's not just the initial delay - its initial delay plus a random portion of the probe period (30sec in our case)
In our use today can we have a the following initialDelay 35sec (which is a measured amount of time that it takes after starting the application pod before the application should (comfortably) be ready) and a readinessPeriod of 30sec. So the 1st readiness probe is delayed by a minimum of 35sec upto an additional 30sec (potential total of 65sec). but an initContainer sleep of only 45sec so instance 1 is only 45sec behind instance 0 and so will attempt to peer before instance 0 is marked ready.
This easily causes the situation I describe above where instance 1 wants to peer but kubelet hasn't yet sent a readiness probe to instance 0 and so instance 0 is notReady and is not in service rotation and so instance 1 can not talk to instance 0 even though the application in instance 0 it totally capable at this point. This silly (undocumented) k8s 'feature' is the root cause of our problem.
Now we understand this we can set:
which allows us to now set the initContainer sleep big enough to cover over this gap which can be 65 sec in total, (I am testing with 35sec+30sec+5sec (5sec for safety!!)) which gives 70sec. We could also reduce the period to reduce the variability but I suggest we shouldn't as we don’t want to increase load on any of the kubelets.
Apart from actually making the thing work more reliably, these changes don't change time taken the RabbitMQ cluster to become available to clients much, if at all. The cluster is 'up' once the 1st instance is running and marked ready (and the application is actually ready). So the 1st instance will be ready 35 to 65sec after pod deployment. The defaults of the (Bitnami) chart are actually sometimes worse as you often hit a situation where the 1st readiness probe fails and then you have to wait another 30sec before the next readiness probe (10+30+30).
Please could I suggest to Kubernetes that the documentation is addresses and possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?) or expose the variability value so that, for time sensitive readienss gates the human can set to zero?
Clarification about services, DNS records and ready probes:
"Headless" (without a cluster IP) Services are also assigned a DNS A or AAAA record, depending on the IP family of the service, for a name of the form my-svc.my-namespace.svc.cluster-domain.example. Unlike normal Services, this resolves to the set of IPs of the pods selected by the Service. Clients are expected to consume the set or else use standard round-robin selection from the set.
Because A or AAAA records are not created for Pod names, hostname is required for the Pod’s A or AAAA record to be created. A Pod with no hostname but with subdomain will only create the A or AAAA record for the headless service ( default-subdomain.my-namespace.svc.cluster-domain.example ), pointing to the Pod’s IP address. Also, Pod needs to become ready in order to have a record unless publishNotReadyAddresses=True is set on the Service.
We do not use publishNotReadyAddresses in our RabbitMQ services (neither headless or not-headless). I don't think we want to do publishNotReadyAddresses because we want ready to be meaningful. I want to prevent the situation where the RabbitMQ pod is in its preflight/boot/initialisation phase (where the application is running but its doing things like validating env vars and other environment/runtime shaping activities prior to really actually starting with the expected args/configs) and something talks to it in that phase. After that phase, the RabbitMQ application gets stopped and later (a second or so later) gets restarted (which I think is a bitnami behaviour rather than purely RabbitMQ application requirement). My brain hurts thinking about what the resultant cluster would look like if you talked to the RabbitMQ application in the preflight/boot/initialisation stage.
k8s-ci-robot commentedon Jan 8, 2021
@JamesTGrant: You can't reopen an issue/PR unless you authored it or you are a collaborator.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
juniorgasparotto commentedon Sep 3, 2021
Same problema here
ffetech commentedon Jul 20, 2024
Same problem.
It's not possible to get a fast startup with simultaneous slower time interval of health checks. What's the reason for this restriction or why is a value for initialDelaySeconds ignored if it's lower than the value for periodSeconds?
Is there any workaround?
dbaumgarten commentedon Sep 20, 2024
The issue is already 6 years old and hasn't been addressed at all.
It should (at the very least) be reopened
Kashemir001 commentedon Mar 21, 2025
May be related to kubernetes/website#48519
TLDR (still under investigation):
initialDelaySeconds
is not actaully a delay for starting probing, it's a time window during which probes are skipped, but theperiodSeconds
timer always starts ticking from the moment container has started.