-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP for Graduating CronJob to GA #978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
barney-s
commented
Apr 20, 2019
- KEP for graduating CronJob to GA
Hi @barney-s. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: barney-s The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@barney-s please see kubernetes/kubernetes#82659 for list of things that needs to be fixed. |
And subsequent comments
e1b5dca
to
ea1088a
Compare
The trouble is, someone with just API access has no idea what that TZ is. Sometimes cluster operators too, who run the controller manager in a container with a UTC timezone on a host with a non-UTC timezone. The main part of my concern I have is people with just API access who want to predict when their job will run. Making the API stable is a promise that the snags are sorted (or sorted enough), so I want to make sure people who are deciding on this can have regard to this particular snag. It's a thing that Kubernetes users often seem to run into. I'd like to address that challenge before GA, by adding a and then, for the initial implementation, constrain the value for
I can make a separate PR against the KEP if other people feel this is an improvement worth doing. |
I see 2 issues.
For 1, Can it be fixed by asking the administrator for the timezone of the master component ? For 2, TBH i don't know if we have a conformance recommendation for timezone of a master. This causes the same set of cronjobs to behave differently across clusters from different providers. Iam thinking per CronJob timezone may be an overkill. Instead it should be a cluster level config or a recommendation. That being said i will reach out to some more folks. @liggitt - Would you have any thoughts on this ? |
Was asked to comment from a PR over at the main repo. After reading the KEP, my impression that this initiative is mostly geared towards scaling, performance and monitoring, which is of course welcome. However, I'm not seeing initiatives that fixes the "100 missed start times" limit. Therefore I would suggest that the aforementioned PR is added to the list of "Fix applicable open issues" list, or at least that some attention is given to it. It's the only issue I've had with Kubernetes CronJobs so far. |
Question I'd ask is: once this is declared stable, does that make it harder to remove that requirement for an out-of-band conversation? |
@sftim please recheck. i have added timezone field. |
@barney-s thanks, that's addressed my concerns. It means that someone who wants to can write an admission controller to default to UTC or to reject CronJobs that don't specify a whitelisted timezone. |
##### Multiple workers | ||
We also propose to have multiple workers controller by a flag similar to [statefulset controller](https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/apps.go#L65). The default would be set to 5 similar to [statefulset](https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/statefulset/config/v1alpha1/defaults.go#L34) | ||
|
||
##### Handling Cron aspect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical problem with the current implementation that I would like to call out.
When determining if a cron should be scheduled, getMissedSchedules()
is used to return the most recent missed schedule time (and a count of missed start times, in the form of extra items in the returned slice). https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/utils.go#L92
This code uses a 3rd party module that has a Next()
function for schedules, but not a Previous()
. Rather than looking back at now.Previous()
, the code repeatedly calls Next()
, until hitting reaching the current time... or hitting a hard cap. I have explored 3 ways to remove the hard cap, ranked in decreasing order of appeal as I understand them:
- Implement a Previous function, by working around the 3rd party library, or by outright replacing it.
- Use a binary search with the
Next()
function, to avoid iterating over every start time in the checked time window. For any subwindow, if there is a start time in the second half of the window, the most recent missed start time is at least at that time. - Remove the concept of "which start time" the job is launched on behalf of. If we only need to know that the Cron Job is due for a run (and is within its startingDeadlineSeconds), we can avoid this pattern. This seems the least ideal due to the decrease in clear status.
We are implementing option 2 at Lyft (as it's far easier to test and have fast confidence in than option 1). 100 missed start times is nowhere near long enough to support our production Cron users, given the myriad of ways that code or the CronJob/Kubernetes machinery can break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vllry - Thanks for this feedback. Are you using a custom version of CronJob or a separate controller ? Is that public ?
We are interested in the myriad ways the machinery can break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are patching the upstream controller (I can open an open source PR soon if you're interested, but it's not yet deployed/proven internally). We considered writing a new controller but are hoping this effort moves forward soon.
We have ~250 CronJobs at present, which entails hitting (a) a lot of infrequent edge cases, and (b) being extremely susceptible to the performance issues outlined in the KEP and my specific comment.
nextScheduleTime += jitter | ||
``` | ||
|
||
### Support Timezone for cronjobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: this may be better suited for a v2. This feels easy to get wrong, as a number of other commenters have suggested. It would be unfortunate to rush a design to v1 while trying to promote the other improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be unfortunate to rush a design to v1 while trying to promote the other improvements.
If rushing is a concern, I'd prefer to extend the beta. Add .spec.timezone
along with other tidying, and try that out for a release cycle (or more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do see a use case for this. Especially with multi cluster scenarios. Having a schedule with local timezone or UTC (depending on master VM/pod) would be good for most use cases. Having a deterministic schedule may be preferable for others.
But i do agree we need to get some consensus on the priority and when it needs to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving it to a separate KEP. I agree it's potentially controversial, while the overall idea behind this KEP isn't. I don't want to have this KEP stuck for too long (it's already hanging too long).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from the scalability perspective.
nextScheduleTime += jitter | ||
``` | ||
|
||
### Support Timezone for cronjobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest moving it to a separate KEP. I agree it's potentially controversial, while the overall idea behind this KEP isn't. I don't want to have this KEP stuck for too long (it's already hanging too long).
- CronJob was introduced in Kubernetes 1.3 as ScheduledJobs | ||
- In Kubernetes 1.8 it was renamed to CronJob and promoted to Beta | ||
|
||
## Alternatives and Further Reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting my "production readiness review" hat, you will also need to fill in the PRR questionaire:
#1620
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Still needs work to chase up loose ends. |
@barney-s are you still working on this one @alaypatel07 and I are working on updating the controller currently, I'd like to take over this KEP and update and finish it, if you don't mind. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
/assign |
Closing in favor of #1996 |