Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds #62036

Closed
RochesterinNYC opened this issue Apr 2, 2018 · 19 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@RochesterinNYC
Copy link

/kind bug

What happened:

A deployment with a readiness probe configured with a command that is guaranteed to instantly succeed with exit code 0 (the command true), an initialDelaySeconds of 10, and a periodSeconds of 300 has its pod go READY after the pod has already been marked as Running for STATUS for at least a minute.

What you expected to happen:

The deployment with a readiness probe configured with the settings delineated above has the readiness probe execute roughly about 10 seconds after the pod goes into Running for STATUS. The pod becomes 1/1 for READY right after this. I would expect that it does not take between 1-5 minutes for the pod to become READY after the container/pod has already been marked as RUNNING for STATUS.

How to reproduce it (as minimally and precisely as possible):

I was able to reproduce it like so with the following minimal deployment.yaml:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        readinessProbe:
          exec:
            command:
            - 'true'
          periodSeconds: 300
          initialDelaySeconds: 10
          timeoutSeconds: 30

I created a fresh cluster on GKE (version is the latest available for GKE: 1.9.4-gke.1) in the us-central1-a zone with one n1-standard-2 node:

$ gcloud container clusters create readiness-probe-test-cluster-latest --num-nodes=1 --machine-type n1-standard-2 --zone us-central1-a --cluster-version=1.9.4-gke.1
Creating cluster readiness-probe-test-cluster-latest...done.
Created [https://container.googleapis.com/v1/projects/operation-covfefe-1/zones/us-central1-a/clusters/readiness-probe-test-cluster-latest].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-a/readiness-probe-test-cluster-latest?project=operation-covfefe-1
kubeconfig entry generated for readiness-probe-test-cluster-latest.
NAME                                 LOCATION       MASTER_VERSION  MASTER_IP       MACHINE_TYPE   NODE_VERSION  NUM_NODES  STATUS
readiness-probe-test-cluster-latest  us-central1-a  1.9.4-gke.1     35.192.183.106  n1-standard-2  1.9.4-gke.1   1          RUNNING

I targeted the cluster via gcloud/kubectl under the hood:

$ gcloud container clusters get-credentials readiness-probe-test-cluster-latest --zone us-central1-a
Fetching cluster endpoint and auth data.
kubeconfig entry generated for readiness-probe-test-cluster-latest.

I set up watching for pods via (in two separate terminal windows:

$ watch -n 1 kubectl get pods

No resources found.
$ watch -n 1 kubectl describe pods

I deployed the deployment via:

$ kubectl create -f deployment.yaml
deployment "nginx" created

I observe that the pod goes into Running STATUS very quickly:

NAME                     READY     STATUS    RESTARTS   AGE
nginx-6c7794546b-2dwf5   0/1       Running   0          7s

The image is pulled and the container started very, very quickly (under 15 seconds slowest):

Events:
  Type    Reason                 Age   From                                                          Message
  ----    ------                 ----  ----                                                          -------
  Normal  Scheduled              11s   default-scheduler                                             Successfully assigned nginx-6c7794546b-2dwf5 to gke-readiness-probe-test-default-pool-0f78aa38-67j1
  Normal  SuccessfulMountVolume  11s   kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1  MountVolume.SetUp succeeded for volume "default-token-vqxhc"
  Normal  Pulling                10s   kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1  pulling image "nginx"
  Normal  Pulled                 6s    kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1  Successfully pulled image "nginx"
  Normal  Created                6s    kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1  Created container
  Normal  Started                6s    kubelet, gke-readiness-probe-test-default-pool-0f78aa38-67j1  Started container

But the actual pod does not go into 1/1 STATUS until after at least a minute. Various times have been observed ranging from about 1 minute to 5 minutes, but it goes into 1/1 status very rarely under at least a minute.

NAME                     READY     STATUS    RESTARTS   AGE
nginx-6c7794546b-2dwf5   1/1       Running   0          1m

Anything else we need to know?:

  • A less visual way to observe the issue is to create the deployment and then immediately time a rollout status command after:
$ kubectl create -f deployment.yaml
deployment "nginx" created
$ time kubectl rollout status deploy/nginx
Waiting for rollout to finish: 0 of 1 updated replicas are available...
deployment "nginx" successfully rolled out

real	4m7.364s
user	0m0.174s
sys	0m0.027s
  • The timings from a few other reproduction runs like above with timeing kubectl rollout status:
real	4m5.264s
user	0m0.142s
sys	0m0.026s
real	3m54.862s
user	0m0.168s
sys	0m0.027s
real	1m3.162s
user	0m0.154s
sys	0m0.024s
real	2m52.104s
user	0m0.144s
sys	0m0.024s
  • I specifically picked a readiness probe command that returns extremely fast/almost immediately (true) for reproduction purposes. I gave it an egregiously large timeoutSeconds value of 30 as well for this purpose. The periodSeconds time of 300 is to emulate our real-life use case with a readiness probe that is more costly resources/rate-limiting wise.

Environment:

  • Kubernetes version (use kubectl version):
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:34:11Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9+", GitVersion:"v1.9.4-gke.1", GitCommit:"10e47a740d0036a4964280bd663c8500da58e3aa", GitTreeState:"clean", BuildDate:"2018-03-13T18:00:36Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: Google Kubernetes Version
  • OS (e.g. from /etc/os-release): COS
  • Kernel (e.g. uname -a): Linux nginx-6c7794546b-wb7dl 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 GNU/Linux
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Apr 2, 2018
@RochesterinNYC
Copy link
Author

I have noted that the issue does not seem to occur when periodSeconds is not specified.

If I create the above deployment without the periodSeconds key/value for readinessProbe specified, timeing the kubectl rollout status deploy/nginx immediately after kubectl create -f deployment.yaml results in timings that seem more fitting/expected:

real	0m18.177s
user	0m0.147s
sys	0m0.021s
real	0m17.621s
user	0m0.134s
sys	0m0.022s
real	0m16.340s
user	0m0.154s
sys	0m0.021s

@RochesterinNYC RochesterinNYC changed the title Odd timing behavior with Readiness Probes with timeoutSeconds and initialDelaySeconds Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds Apr 2, 2018
@ghost
Copy link

ghost commented Apr 2, 2018

I am also experiencing the same issue, and have been able to successfully reproduce it with @RochesterinNYC's minimal deployment manifest. With periodSeconds set to 300, my replicaset took over a minute to reach 1/1, yet without periodSeconds specified, my replicaset reached 1/1 in under 5 seconds.

@dims
Copy link
Member

dims commented Apr 4, 2018

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 4, 2018
@yguo0905
Copy link
Contributor

/assign

@mattnworb
Copy link
Contributor

Seems likely that this is caused by the jitter/sleep that kubelet does before sending the first probe:

time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))

@RochesterinNYC
Copy link
Author

Noting also that the impetus for this issue is the described/intended behavior of readiness probes/liveness probes as dictated at https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Probes have a number of fields that you can use to more precisely control the behavior of liveness and readiness checks:

- initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated.
- periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
- timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.

If there is something else that should be happening or the behavior observed in this issue is actually "expected" when these options are configured together, then that should potentially be called out in the documentation.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 25, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@georgekaz
Copy link

I don't think this should have been left to go stale because this issue is real and goes against how the probes are described in the docs. We have a probe that, due to API limits, can only happen every 600s, but that should not mean the first probe takes up-to 10m to happen after a pod first starts. I understand the logic about not wanting all probes to happen at once if the kubelet crashes, but in the case of a single pod creation it should not be a problem.

@ankrause
Copy link

/reopen
/remove-lifecycle rotten

I think this should be re-opened as I was about to create a new issue with essentially the same problem.

I was expecting readiness probes (and presumably liveness probes as well) to execute the first probe immediately upon initialDelaySeconds expiring. Instead, it may take a substantial amount of time before the pod is marked "Ready" upon the expiration of the initial wait. The suggestion about the jitter calculation above looks promising as my results were very confusing from my experimentation with shorter/longer values for periodSeconds. The more I tested this the more confused I got. Using a simple ls command, initial delay of 30s, and period of 200s, my pod entered the ready state at approximately 2m30s after the container start event. Conversely, an initial delay of 0 with a wait of 3600 essentially never becomes ready, despite the command being guaranteed to succeed.

All of the examples on the Configure Liveness and Readiness Probes page use an initial delay shorter than period seconds, and generally short values across the board. Text blocks like this:

In the configuration file, you can see that the Pod has a single Container. The periodSeconds field specifies that the kubelet should perform a liveness probe every 5 seconds. The initialDelaySeconds field tells the kubelet that it should wait 5 second before performing the first probe.

leave it semi-ambiguous as to whether the probe is executed immediately at that 5 second mark or has some built-in delay (regardless of whether that delay is based on periodSeconds).

@k8s-ci-robot
Copy link
Contributor

@ankrause: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

I think this should be re-opened as I was about to create a new issue with essentially the same problem.

I was expecting readiness probes (and presumably liveness probes as well) to execute the first probe immediately upon initialDelaySeconds expiring. Instead, it may take a substantial amount of time before the pod is marked "Ready" upon the expiration of the initial wait. The suggestion about the jitter calculation above looks promising as my results were very confusing from my experimentation with shorter/longer values for periodSeconds. The more I tested this the more confused I got. Using a simple ls command, initial delay of 30s, and period of 200s, my pod entered the ready state at approximately 2m30s after the container start event. Conversely, an initial delay of 0 with a wait of 3600 essentially never becomes ready, despite the command being guaranteed to succeed.

All of the examples on the Configure Liveness and Readiness Probes page use an initial delay shorter than period seconds, and generally short values across the board. Text blocks like this:

In the configuration file, you can see that the Pod has a single Container. The periodSeconds field specifies that the kubelet should perform a liveness probe every 5 seconds. The initialDelaySeconds field tells the kubelet that it should wait 5 second before performing the first probe.

leave it semi-ambiguous as to whether the probe is executed immediately at that 5 second mark or has some built-in delay (regardless of whether that delay is based on periodSeconds).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 31, 2019
@mzmudziak
Copy link

mzmudziak commented Jul 5, 2019

I am also running into this issue, the readinessProbe doesn't start for quite a long time (longer than the initialDelaySeconds) if the periodSeconds is set to a higher value.
How exactly is this issue closed already?

@tmmorin
Copy link

tmmorin commented Oct 15, 2020

/reopen.
/remove-lifecycle rotten

@tmmorin
Copy link

tmmorin commented Oct 15, 2020

It seems to me that something among the following would be needed:
a) update documentation to warn users about using an initialDelaySeconds lower than periodSeconds, and that all timers are only honored approximately because of a random timer of up to periodSeconds is being added
b) update kubelet code to use a lower jitter bounded by initialDelaySeconds for the first probe
c) prevent users from using a initialDelaySeconds lower than periodSeconds

@JamesTGrant
Copy link

/reopen
/remove-lifecycle rotten

TL;DR:
Please could I suggest to Kubernetes that
a) the documentation is addressed and
b) possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?)

or
c) expose the 'variability' in the readiness probe sleep so that, for time sensitive readienss gates the human can set to zero?

Detail:
I've spent this week looking at instances where we sometimes see a statefulset fail to cluster properly on initial deployment in our wider application
• We have a statefulset and 3 instances which deploy one after the other. We use Parallel rather then OrderedReady to remove dependcies on lower numbered instances. We use an init container to 'sleep', the amount of sleep is based on the pod instance name, instance 0 slepps 0 seconds, instance 1 sleeps n1 seconds, onstance 2 sleeps n2 seconds.
• When the statefulset deploys all pods appear in the endpoint list as notReady and at the point where the pod passes its readiness probe it is marked ready and the enters service rotation (more/better precise terminology for ‘service rotation’ later – just go with it!).
• Peer discovery in application pod doesn't actually care if endpoints are ready or not. When application peer discovery occurs it gets the list of endpoints and tries to talk to everything in the list. Whether or not those endpoints are ready or not it'll try and talk to them anyway. However, you can't talk to a peer pod that is notReady because it's not in service rotation yet, even if you ask for it using the pod’s FQDN (even if the application running in the pod was capable/ready at the point a peer wanted to talk).
• When instance 0 deploys there are 2 peers but none will talk to it (they are not in service rotation, and the application pods are not even deployed yet!) so the application attempt to chat to the peers fails so instance 0 forms a cluster on its own.
• When instance 0 deploys there is a period of time before it is 'ready' according to the readiness probe.
• When instance 1 deploys it sees there are 2 peers. Instance 2 is not yet deployed (container not even started yet) but instance 0 pod is running. At this point the pod must be running and ready otherwise the peering will fail, and instance 1 will form a cluster all by itself – which is bad. So we must ensure that the readiness probe passes for instance 0 before instance 1 attempts to peer. I've looked carefully at the application source code (happens to be RabbitMQ in this case) - during the initial peering process, if it attempts to talk to a peer in the peer list and it isn't possible to establish a TCP connection then it fails immediately. No timeouts, no retries. There are lots of reasons why it may not be possible to establish a connection (DNS outage etc etc) but if the peer isn't in service rotation (is not ready) then it'll certainly fail.
• When instance 2 deploys it sees there are 2 peers. By this time it's pretty much certain that instance 0 has entered the ready state and so 2 can peer with 0. It’s unlikely that 1 is ready yet so it’ll peer with instance 0. Sometimes, if instance 1 is ready then it may pick instance 1 rather than 0
End result:
0 and 2 are co-clustered, 1 is in its own cluster - Rabbit cluster is partitioned and will never 'heal' at this point because instance 1 thinks it's doing the right thing and there's nothing to heal and instance 0 and 2 were never aware of 1.
Or
1 and 2 are co-clustered, 0 is in its own cluster - Rabbit cluster is partitioned and will never 'heal' at this point because instance 0 thinks it's doing the right thing and there's nothing to heal and instance 1 and 2 were never trying to cluster with 0.

So - why, despite our careful initContainer delays and careful Readiness probe delay setting, do we see issues with application cluster formation?
In the k8s kubelet source code there is this little bit of code before the first readiness probe is sent after starting a pod.

func (w *worker) run() {

`// run periodically probes the container.

func (w *worker) run() {
probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second

// If kubelet restarted the probes could be started in rapid succession.
// Let the worker wait for a random portion of tickerPeriod before probing.
time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))

probeTicker := time.NewTicker(probeTickerPeriod)

`

FYI:- rand.Float64 returns, as a float64, a pseudo-random number in the range [0.0 to 1.0]

You can see it's not just the initial delay - its initial delay plus a random portion of the probe period (30sec in our case)
In our use today can we have a the following initialDelay 35sec (which is a measured amount of time that it takes after starting the application pod before the application should (comfortably) be ready) and a readinessPeriod of 30sec. So the 1st readiness probe is delayed by a minimum of 35sec upto an additional 30sec (potential total of 65sec). but an initContainer sleep of only 45sec so instance 1 is only 45sec behind instance 0 and so will attempt to peer before instance 0 is marked ready.
This easily causes the situation I describe above where instance 1 wants to peer but kubelet hasn't yet sent a readiness probe to instance 0 and so instance 0 is notReady and is not in service rotation and so instance 1 can not talk to instance 0 even though the application in instance 0 it totally capable at this point. This silly (undocumented) k8s 'feature' is the root cause of our problem.

Now we understand this we can set:

  1. the readinessIntialDelay big enough to ensure the application starts and so it passes the 1st readiness probe. Interestingly, the 3rd party chart defaults often show a failure of the 1st readiness probe if the 1st readiness probe is sent close to the readinessInitialDelay (which is 10 sec) but not if the 1st readiness probe is sent toward the end of readinessInitialDelay+readinessPeriod (which is 40sec). (i.e; default initialDelay 10sec is too short)
  2. ensure we set the gap between each statefulset instance deploying the application pod to at least readinessInitialDelay+readinessPeriod
    which allows us to now set the initContainer sleep big enough to cover over this gap which can be 65 sec in total, (I am testing with 35sec+30sec+5sec (5sec for safety!!)) which gives 70sec. We could also reduce the period to reduce the variability but I suggest we shouldn't as we don’t want to increase load on any of the kubelets.

Apart from actually making the thing work more reliably, these changes don't change time taken the RabbitMQ cluster to become available to clients much, if at all. The cluster is 'up' once the 1st instance is running and marked ready (and the application is actually ready). So the 1st instance will be ready 35 to 65sec after pod deployment. The defaults of the (Bitnami) chart are actually sometimes worse as you often hit a situation where the 1st readiness probe fails and then you have to wait another 30sec before the next readiness probe (10+30+30).

Please could I suggest to Kubernetes that the documentation is addresses and possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?) or expose the variability value so that, for time sensitive readienss gates the human can set to zero?

Clarification about services, DNS records and ready probes:
"Headless" (without a cluster IP) Services are also assigned a DNS A or AAAA record, depending on the IP family of the service, for a name of the form my-svc.my-namespace.svc.cluster-domain.example. Unlike normal Services, this resolves to the set of IPs of the pods selected by the Service. Clients are expected to consume the set or else use standard round-robin selection from the set.

Because A or AAAA records are not created for Pod names, hostname is required for the Pod’s A or AAAA record to be created. A Pod with no hostname but with subdomain will only create the A or AAAA record for the headless service ( default-subdomain.my-namespace.svc.cluster-domain.example ), pointing to the Pod’s IP address. Also, Pod needs to become ready in order to have a record unless publishNotReadyAddresses=True is set on the Service.
We do not use publishNotReadyAddresses in our RabbitMQ services (neither headless or not-headless). I don't think we want to do publishNotReadyAddresses because we want ready to be meaningful. I want to prevent the situation where the RabbitMQ pod is in its preflight/boot/initialisation phase (where the application is running but its doing things like validating env vars and other environment/runtime shaping activities prior to really actually starting with the expected args/configs) and something talks to it in that phase. After that phase, the RabbitMQ application gets stopped and later (a second or so later) gets restarted (which I think is a bitnami behaviour rather than purely RabbitMQ application requirement). My brain hurts thinking about what the resultant cluster would look like if you talked to the RabbitMQ application in the preflight/boot/initialisation stage.

@k8s-ci-robot
Copy link
Contributor

@JamesTGrant: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

TL;DR:
Please could I suggest to Kubernetes that
a) the documentation is addressed and
b) possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?)

or
c) expose the 'variability' in the readiness probe sleep so that, for time sensitive readienss gates the human can set to zero?

Detail:
I've spent this week looking at instances where we sometimes see a statefulset fail to cluster properly on initial deployment in our wider application
• We have a statefulset and 3 instances which deploy one after the other. We use Parallel rather then OrderedReady to remove dependcies on lower numbered instances. We use an init container to 'sleep', the amount of sleep is based on the pod instance name, instance 0 slepps 0 seconds, instance 1 sleeps n1 seconds, onstance 2 sleeps n2 seconds.
• When the statefulset deploys all pods appear in the endpoint list as notReady and at the point where the pod passes its readiness probe it is marked ready and the enters service rotation (more/better precise terminology for ‘service rotation’ later – just go with it!).
• Peer discovery in application pod doesn't actually care if endpoints are ready or not. When application peer discovery occurs it gets the list of endpoints and tries to talk to everything in the list. Whether or not those endpoints are ready or not it'll try and talk to them anyway. However, you can't talk to a peer pod that is notReady because it's not in service rotation yet, even if you ask for it using the pod’s FQDN (even if the application running in the pod was capable/ready at the point a peer wanted to talk).
• When instance 0 deploys there are 2 peers but none will talk to it (they are not in service rotation, and the application pods are not even deployed yet!) so the application attempt to chat to the peers fails so instance 0 forms a cluster on its own.
• When instance 0 deploys there is a period of time before it is 'ready' according to the readiness probe.
• When instance 1 deploys it sees there are 2 peers. Instance 2 is not yet deployed (container not even started yet) but instance 0 pod is running. At this point the pod must be running and ready otherwise the peering will fail, and instance 1 will form a cluster all by itself – which is bad. So we must ensure that the readiness probe passes for instance 0 before instance 1 attempts to peer. I've looked carefully at the application source code (happens to be RabbitMQ in this case) - during the initial peering process, if it attempts to talk to a peer in the peer list and it isn't possible to establish a TCP connection then it fails immediately. No timeouts, no retries. There are lots of reasons why it may not be possible to establish a connection (DNS outage etc etc) but if the peer isn't in service rotation (is not ready) then it'll certainly fail.
• When instance 2 deploys it sees there are 2 peers. By this time it's pretty much certain that instance 0 has entered the ready state and so 2 can peer with 0. It’s unlikely that 1 is ready yet so it’ll peer with instance 0. Sometimes, if instance 1 is ready then it may pick instance 1 rather than 0
End result:
0 and 2 are co-clustered, 1 is in its own cluster - Rabbit cluster is partitioned and will never 'heal' at this point because instance 1 thinks it's doing the right thing and there's nothing to heal and instance 0 and 2 were never aware of 1.
Or
1 and 2 are co-clustered, 0 is in its own cluster - Rabbit cluster is partitioned and will never 'heal' at this point because instance 0 thinks it's doing the right thing and there's nothing to heal and instance 1 and 2 were never trying to cluster with 0.

So - why, despite our careful initContainer delays and careful Readiness probe delay setting, do we see issues with application cluster formation?
In the k8s kubelet source code there is this little bit of code before the first readiness probe is sent after starting a pod.

func (w *worker) run() {

`// run periodically probes the container.

func (w *worker) run() {
probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second

// If kubelet restarted the probes could be started in rapid succession.
// Let the worker wait for a random portion of tickerPeriod before probing.
time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))

probeTicker := time.NewTicker(probeTickerPeriod)

`

FYI:- rand.Float64 returns, as a float64, a pseudo-random number in the range [0.0 to 1.0]

You can see it's not just the initial delay - its initial delay plus a random portion of the probe period (30sec in our case)
In our use today can we have a the following initialDelay 35sec (which is a measured amount of time that it takes after starting the application pod before the application should (comfortably) be ready) and a readinessPeriod of 30sec. So the 1st readiness probe is delayed by a minimum of 35sec upto an additional 30sec (potential total of 65sec). but an initContainer sleep of only 45sec so instance 1 is only 45sec behind instance 0 and so will attempt to peer before instance 0 is marked ready.
This easily causes the situation I describe above where instance 1 wants to peer but kubelet hasn't yet sent a readiness probe to instance 0 and so instance 0 is notReady and is not in service rotation and so instance 1 can not talk to instance 0 even though the application in instance 0 it totally capable at this point. This silly (undocumented) k8s 'feature' is the root cause of our problem.

Now we understand this we can set:

  1. the readinessIntialDelay big enough to ensure the application starts and so it passes the 1st readiness probe. Interestingly, the 3rd party chart defaults often show a failure of the 1st readiness probe if the 1st readiness probe is sent close to the readinessInitialDelay (which is 10 sec) but not if the 1st readiness probe is sent toward the end of readinessInitialDelay+readinessPeriod (which is 40sec). (i.e; default initialDelay 10sec is too short)
  2. ensure we set the gap between each statefulset instance deploying the application pod to at least readinessInitialDelay+readinessPeriod
    which allows us to now set the initContainer sleep big enough to cover over this gap which can be 65 sec in total, (I am testing with 35sec+30sec+5sec (5sec for safety!!)) which gives 70sec. We could also reduce the period to reduce the variability but I suggest we shouldn't as we don’t want to increase load on any of the kubelets.

Apart from actually making the thing work more reliably, these changes don't change time taken the RabbitMQ cluster to become available to clients much, if at all. The cluster is 'up' once the 1st instance is running and marked ready (and the application is actually ready). So the 1st instance will be ready 35 to 65sec after pod deployment. The defaults of the (Bitnami) chart are actually sometimes worse as you often hit a situation where the 1st readiness probe fails and then you have to wait another 30sec before the next readiness probe (10+30+30).

Please could I suggest to Kubernetes that the documentation is addresses and possibly modify the sleep function and constrain the variability by a smaller, hardcoded, amount independent from readinessPeriod (a random portion of 5sec should be enough for anyone, right...?) or expose the variability value so that, for time sensitive readienss gates the human can set to zero?

Clarification about services, DNS records and ready probes:
"Headless" (without a cluster IP) Services are also assigned a DNS A or AAAA record, depending on the IP family of the service, for a name of the form my-svc.my-namespace.svc.cluster-domain.example. Unlike normal Services, this resolves to the set of IPs of the pods selected by the Service. Clients are expected to consume the set or else use standard round-robin selection from the set.

Because A or AAAA records are not created for Pod names, hostname is required for the Pod’s A or AAAA record to be created. A Pod with no hostname but with subdomain will only create the A or AAAA record for the headless service ( default-subdomain.my-namespace.svc.cluster-domain.example ), pointing to the Pod’s IP address. Also, Pod needs to become ready in order to have a record unless publishNotReadyAddresses=True is set on the Service.
We do not use publishNotReadyAddresses in our RabbitMQ services (neither headless or not-headless). I don't think we want to do publishNotReadyAddresses because we want ready to be meaningful. I want to prevent the situation where the RabbitMQ pod is in its preflight/boot/initialisation phase (where the application is running but its doing things like validating env vars and other environment/runtime shaping activities prior to really actually starting with the expected args/configs) and something talks to it in that phase. After that phase, the RabbitMQ application gets stopped and later (a second or so later) gets restarted (which I think is a bitnami behaviour rather than purely RabbitMQ application requirement). My brain hurts thinking about what the resultant cluster would look like if you talked to the RabbitMQ application in the preflight/boot/initialisation stage.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@juniorgasparotto
Copy link

Same problema here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests