Skip to content

Add kubelet plugin manager #73891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 31, 2019
Merged

Conversation

taragu
Copy link
Contributor

@taragu taragu commented Feb 10, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
Implement a controller that manages plugin registration/unregistration

Which issue(s) this PR fixes:
Fixes #73371

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Kubelet plugin registration now has retry and exponential backoff logic for when registration of plugins (like CSI or device plugin) fail.

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 10, 2019
@taragu
Copy link
Contributor Author

taragu commented Feb 10, 2019

* "reconciler" should use the [`nestedpendingoperations` library]
to execute registration and unregistration operations. 
This library automatically takes care of exponential backoff and ensures no more
then one operation (registration or unregistration) can happen at a given time for 
a given socket.

@saad-ali I was looking at nestedpendingoperations but wasn't sure how to use it to register/unregister plugin. Could you please point me to a similar example?

@taragu
Copy link
Contributor Author

taragu commented Feb 10, 2019

cc @vikaschoudhary16

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 18, 2019
@saad-ali
Copy link
Member

@taragu there is a generic version of the nestedpendingoperation that is simpler and not volume specific called goroutinemap. See https://github.com/kubernetes/kubernetes/blob/master/pkg/util/goroutinemap/goroutinemap.go (edited)

The operation_executor is just a wrapper around nestedpendingoperation (or goroutinemap). I would define a new operation_executor for the plugin manager that defines two operations: RegisterPlugin() and UnregisterPlugin(). This operation_executor would use goroutinemap to ensure that no more then one operation starts for the same socket. So the key that you use in the goroutinemap (I think) would be the socket path.

Once you have an operation_executor defined, you can create two caches: desired and actual that keep track of the sockets that we want to register and the sockets that we actually have successfully registered.

Then you can create the dumb diff loop, the reconciler, it will (maybe every 100 ms) diff desired and actual. If a socket is in desired but not in actual it will call operation_executor.Register(). If a socket is in actual but not desired it will call operation_executor.Unregister(). And because operation_executor uses goroutinemap it will refuse to start a new register or unregister operation while one is pending (and automatically handle exponential backoff).

Once you have that, you need to write a populator -- in this case that will basically just be the existing code that currently calls the old Register method -- and instead it would add to the desired state cache on register, and remove from desired state cache on unregister (populator in this case doesn't need to be a new package).

@taragu
Copy link
Contributor Author

taragu commented Feb 20, 2019

@saad-ali thank you so much!! This is very comprehensive. I'll try to digest this later this evening.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 25, 2019
@taragu taragu force-pushed the plugin-manager branch 7 times, most recently from 17aa31b to a6f39d6 Compare March 3, 2019 22:34
@taragu taragu force-pushed the plugin-manager branch 3 times, most recently from 1853dcf to dc863d8 Compare March 9, 2019 14:10
Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taragu, one more thing I forgot, we should undo the temporary fix from #72873 as well. We could do that in a follow up PR if you want.

@taragu
Copy link
Contributor Author

taragu commented May 10, 2019

@saad-ali sounds good. I will undo #72873 in a followup PR

@msau42
Copy link
Member

msau42 commented May 13, 2019

/hold
Spoke with @vishh. There are still some discussions happening with @saad-ali around if we can just have the drivers restart

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2019
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 30, 2019
@vishh
Copy link
Contributor

vishh commented May 30, 2019

/approve
(after offline discussion with @saad-ali who patiently explained that k'let can silently fail re-registration when the plugins assume normal operation)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: saad-ali, taragu, vishh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 30, 2019
@saad-ali
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2019
@saad-ali
Copy link
Member

/milestone v1.15

@k8s-ci-robot k8s-ci-robot added this to the v1.15 milestone May 30, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2019
@saad-ali
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 30, 2019
@taragu
Copy link
Contributor Author

taragu commented May 31, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2019
@taragu
Copy link
Contributor Author

taragu commented May 31, 2019

/test pull-kubernetes-integration

1 similar comment
@taragu
Copy link
Contributor Author

taragu commented May 31, 2019

/test pull-kubernetes-integration

@k8s-ci-robot k8s-ci-robot merged commit fe37733 into kubernetes:master May 31, 2019
asw.Lock()
defer asw.Unlock()

if _, ok := asw.socketFileToInfo[socketPath]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the check is needed - we can delete the entry directly.
It is fine if the key doesn't exist.

dsw.Lock()
defer dsw.Unlock()

if _, ok := dsw.socketFileToInfo[socketPath]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check prior to deletion

return nil
}

// Dial establishes the gRPC communication with the picked up plugin socket. https://godoc.org/google.golang.org/grpc#Dial
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be lowercase: dial

}

func (rc *reconciler) getHandlers() map[string]cache.PluginHandler {
rc.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting read lock should suffice

actualStateOfWorldUpdater ActualStateOfWorldUpdater) func() error {

unregisterPluginFunc := func() error {
client, conn, err := dial(socketPath, dialTimeoutDuration)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plugin has already exited, the dial here will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client is needed to retrieve from pluginHandlers using infoResp.Type

Do you have suggestion on how to get plugin handler in that case ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client is needed to retrieve from pluginHandlers using infoResp.Type

Do you have suggestion on how to get plugin handler in that case ?

The csi plugin will delete the sock file when it exits. When the operationExecutor.UnregisterPlugin is called, the sock file is not existed.
Can add a sock file to the plugin mapping table. ≧◠◡◠≦ Just like 1.14:


func (w *Watcher) getPlugin(socketPath string) (pathInfo, bool) {
	w.mutex.Lock()
	defer w.mutex.Unlock()

	plugin, ok := w.plugins[socketPath]
	return plugin, ok
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my POV, we can save socketPath: infoResp.Type in actualStateOfWorld on RegisterPlugin phase, and then it can be used on UnregisterPlugin phase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made similar suggestion over #87282

Let's see what other people think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retries for kubelet plugin registration should be at a lower layer
8 participants