New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LivenessProbe should start after ReadinessProbe Succeeded if ReadinessProbe is specified #27114
Comments
I see this is still not implemented. This would really be much appreciated and I think its a pretty logical way to think about it: [ Eventually... Having liveness probe kill off the pod before it's ready is frustrating and counter-intuitive. As mentioned, that should be handled completely in the readiness probe - with a set limit on time to ready / check iterations. As it stands now I have to create some rough estimates about how long I should delay my liveness probe - which is not always an easy constant to track. |
@kubernetes/sig-node-feature-requests |
The change is easy to implement however what is the correct behavior given what is done today. |
Understandably, this will cause unexpected behaviour because of the way people have bent their liveness probes around this at the moment. I suggest an extra variable on And maybe a warning somewhere that on future versions (however far down the line) liveness probes will wait by default. Or just stick with the extra |
I totally like this idea, my general understanding of a I vote for making |
Copying and pasting my own comment from the PR: Changing the interaction of these two checks will surprise many users. |
@yujuhong I can see that and its a fair understanding. But it doesn't change the fact that there is no easy way to ask your liveness probes to wait until your instances are actually ready- hence causing an early shutdown if you didn't offset it enough. This is not a stable solution at all. In any case, it seems that the ambiguity of it has caused it to mean different things to different people. I can't think of a time I would consider a pod "alive" but not "ready", so that's why the default behaviour seems strange to me. I guess the default behaviour should remain the same then. And as suggested, if there is an additional variable such as Unless there was some other kind of probe that acts as a precursor to both the |
I don't think it's completely unreasonable for a liveness check to not
start until a readiness check passes (possibly as an option). But the
edge cases can be complicated (such as on kubelet restart) so it needs
to be sufficiently justified as a usability and "ease of use" feature
to raise above the bar.
|
Would it be possible to use a toleration/taint approach for the pod to manage the "state" of this lifecycle in a more persistent way and alleviate the use case of the Kubelet start? I do understand how this could get a bit tricky, where we may need to also then keep the timestamp of the pods initial Readiness check. If the pod gets evicted and replaced or restarted, what should happen then, etc. In my opinion, as the StatefulSet paradigm continues to increase, this will become an increasingly prevalent issue. In our case we are attempting to run a few things like Kafka and Elasticsearch which, in an exception case can take dramatically longer to start up (rebuilding indices and rebalancing) than a clean restart would do. We are currently evaluating a few options but having the ability to indicate the Liveness check's initial delay should only start after the Readiness passes the first time would take a lot of the guess work out of it for us. |
hmm what about the scenario where you want this as in if the startup configuration is dynamic waiting for a service that doesn't exist anymore. Then you want to restart the configuration step again to pick up the new changes else it will never boot up |
In that case you could have your readiness probe fail in the first place
and just be polling on that external service being discoverable.
…On Sun, Aug 20, 2017 at 4:29 AM Gert Cuykens ***@***.***> wrote:
hmm what about the scenario where you want this as in if the startup
configuration is dynamic waiting for a services that doesn't exist anymore.
Then you want to restart the configuration step again to pick up the new
changes else it will never boot up
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27114 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAWanGNlf6EotOWa4-B_0aH_GV9YZ4k8ks5sZ_x2gaJpZM4IxobL>
.
|
Just stumbled upon this problem on rolling out the deployment resource which has a long-ish start-up time. Deployment has both readiness and liveness probes specified. Because the app takes a lot of time to start the failing liveness probe eventually restarts the pod and thus deployment enters the crash/restart loop. This is quite non-intuitive, especially because I use IMO, at least on deployment rollout, it would be more logical to start probing for liveness after deployment's pod is ready. |
I ran into this same issue when first learning K8s. I agree with the previous comments that logically liveness needs to follow readiness. |
How about adding "start before delay if ready" instead of "start when ready" property? so liveness probe would start either after |
It makes sense that K8s has treated these separately so that the correspondence between check and its effect is clear. I think we should add an extra parameter for the liveness probe |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
No updates on this? Adding the extra parameter would make a lot of sense. We sometimes do database migrations on pod startup (If something big changed) and that can take an unknown amount of time, so it feels a bit unnecessary to always have the initialDelay very high just to avoid the pod getting killed when doing the migration. If it would wait for readiness instead it would feel much more safe. I understand that this is not always the case, so this extra parameter would solve the issue for all parties I think. |
After spending a few hours reading through this issue, all the docs, etc., I'm still finding it annoyingly complex to account for a pod where, the first time it starts up and configures everything, it could take 5-10 minutes, but then after that subsequent pods could take anywhere from < 1 second to 1 minute. I want the pods to come up quickly, and also stop serving requests quickly if there's a failure in one of the probes, but to account for the fact that initialization the first time could take 10 minutes, I have to leave padding in both my probes (since I can't have one run before the other kicks in) and either make it so no container is 'ready' until after 10 min or so, or containers could be in a bad state (but not marked 'unhealthy') for up to 10 min or so. I, like others earlier in this thread mentioned, would love to have an annotation or configuration option to have it be like "livenessBeforeReadiness" so I can use that kind of behavior (which seems more intuitive to me?). |
For the most part we have been able to solve these kinds of issues with init containers. The init container handles the case where setup takes an unusual amount of time. For the happy path it just exits with a no-op. |
@wstrange as has been mentioned, for some of us, init containers aren't the solution because the software we are dealing with can't be modified in a way that would support them. It's great when you can, but it simply isn't possible in all situations. There are of course workaround for that, but as @matthyx has observed, it would be much better if they were unnecessary. |
Be careful with not executing your probes, as it leaves you blind regarding the state of your container. The main idea behind something like |
@matthyx - That's exactly why I think this issue has had so much uptake. The way liveness/readiness probes currently work is quite fragile and/or nerve-wracking because you can't really optimize for the use case of "pods can take a while to start and be responsive sometimes, and that's okay". I was looking at #71449 and it looks like there is still a decently-long path towards it hitting a stable release; therefore I have to still find ways to work around this shortcoming for probably the next year or two at least (I'm on EKS... still on 1.11 :P). I will definitely look into using an |
I agree, and that's why I have submitted a talk for the next Kubecon to raise awareness and hopefully get smart minds working on an acceptable solution.
You can also try a wrapper for the livenessProbe, as suggested in my previous comment. |
Having an init container to handle the expected behavior of a liveness probe to start checking after the readinessprobe seems counter intuitive. An implementation from the past with a configured delay before it starts checking, should not hinder the development of this expected behaviour. |
Proposal accepted, see you in Barcelona! In the meantime, could someone merge my KEP (kubernetes/enhancements#860)? |
We ended up building a sidecar container that fronts This, however, is called on the first boot of the container as well. The side car is written in go so it comes up really quick but still not quick enough because it fails the first I'm wondering what the purpose of an httpGet Other than having an init container installing it we can't rely on a given binary like curl being present on any container we build/use. Using curl though would allow me to have retries built into the I agree with what a lot of people have already said: this feels like logic that should be abstracted into kubernetes. I don't necessarily know what the solution should be but I think the extra property in the |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
startupProbe is alpha in 1.16! |
@matthyx thanks! From a brief look at the pull request, it looks a bit different to what was talked about here. Is there any documentation we can look over about how this startupProbe works ? EDIT: Nevermind, I think this is it here. |
@geerlingguy we share the same case. A Laravel App took sometimes up to 4 minutes to start (php-fpm) . |
@geerlingguy Looks like the new probe startupProbe will solve the issue here : 🎊 🎊 🎊 startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10 |
Indeed! |
As evidenced by kubernetes/kubernetes#27114 the naming leads one to think that a liveness probe executes after a readiness probe. This is not the case.
As evidenced by kubernetes/kubernetes#27114 the naming leads one to think that a liveness probe executes after a readiness probe. This is not the case.
As evidenced by kubernetes/kubernetes#27114 the naming leads one to think that a liveness probe executes after a readiness probe. This is not the case.
As evidenced by kubernetes/kubernetes#27114 the naming leads one to think that a liveness probe executes after a readiness probe. This is not the case.
Our initial understanding is that liveness probe will start to check after readiness probe was succeeded but it turn out not to be like that.
We are testing with our system that has a long boot time, an approximation of boot time is between 1-3 minutes. We specify readiness probe with same url of liveness probe and specify initial delay of liveness probe to be 30 seconds, we found that the pod is killed by failing of liveness probe while readiness probe still in a failure.
So, why don't we specify initial delay of liveness probe to be more than 3 minutes? Well, it might be a case when readiness probe succeed at first minute and after that the pod will fail to do a job before liveness probe will start. So, it will affect running service.
Another point is why we don't only put readiness probe and leave out liveness probe? When readiness probe fail again it might take out of service as well so, it don't cause any affect for running service but it won't be restarted and we have to do manual restart them.
There are some concerns we thought about, here are list:
Maintenance
so that liveness probe should stop working when enter this state and no number of failing readiness probe should not be considered.What do you think?
The text was updated successfully, but these errors were encountered: