Closed
Description
We're currently sending whole NodeStatuses only to update a Node heartbeat. This a big source of traffic in the cluster, and one of the possible causes of 1000-node cluster failures. We need to extract heartbeat to a new API 'Heartbeat' object with timestamp and object reference only, and make Kubelet/NodeController to use this object instead of NodeStatus to determine health of the Node.
cc: @wojtek-t @lavalamp @bgrant0607 @brendandburns @smarterclayton @timothysc @davidopp @fgrzadkowski
Metadata
Metadata
Assignees
Labels
Indicates an issue on api area.Categorizes issue or PR as related to a new feature.Indicates that an issue or PR is actively being worked on by a contributor.Must be staffed and worked on either currently, or very soon, ideally in time for the next release.Categorizes an issue or PR as relevant to SIG Node.Categorizes an issue or PR as relevant to SIG Scalability.
Type
Projects
Relationships
Development
No branches or pull requests
Activity
wojtek-t commentedon Sep 29, 2015
cc @dchen1107
smarterclayton commentedon Sep 29, 2015
Why not just add a heartbeat subresource to the node that performs the
operation?
EDIT: to do the update, but continue to just retrieve nodes as is. I don't see the need to retrieve a Heartbeat, just to set it. We don't have a separate pod binding object, we just have the "bind" verb on the pod that mutates nodeName
gmarek commentedon Sep 29, 2015
That's the plan for the implementation, but we still need to store it somewhere, hence the need for an API object.
smarterclayton commentedon Sep 29, 2015
Why wouldn't the store continue to be on the node? Virtual resources like scale / bind / etc already handle this. Maybe that's what you're suggesting, just want to be sure.
smarterclayton commentedon Sep 29, 2015
I.e.
PUT /nodes/foo/heartbeat
simply sets lastProbeTime and accepts the "Heartbeat" virtual sub resource you described.wojtek-t commentedon Sep 29, 2015
But this would require changing the implementation of how subresources are implemented. The problem is that currently we just send the whole object in that case (e.g. whole pod in case of binding). We would need efficient PATCH operation that would send only data that is changing (not the whole object).
smarterclayton commentedon Sep 29, 2015
Are you looking to reduce UPDATE cost, or WATCH of updated node cost?
Binding is a minimal resource - it's just the object reference for the target node and the name of the pod. For the former, you would POST/PUT heartbeat "lastProbeTime" to the apiserver, which would run a guaranteed update on node that only alters the lastProbeTime. The kubelets would only send a minimal object, but anyone watching would still get the update.
For the latter, if the resources are truly split, wouldn't that require the node controller to fetch and watch two objects in order to stitch that data together? I didn't think watch on node was the problem, which is why I assume you meant the former.
wojtek-t commentedon Sep 29, 2015
Yes - I mostly meant the former.
For the watch - with the new watch in apiserver - we will read the data from etcd only once and then send it only to interested watchers (and there should be constant number of those).
So yes - I'm mostly worried about write path (not read path).
smarterclayton commentedon Sep 29, 2015
Binding is modeled today as:
The result of that is persisted into the pod in etcd via the GuaranteedUpdate loop.
Could be very similar.
117 remaining items