Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees

When we added resourceVersion=0 to reflectors, we didn't properly reason about its impact on nodes.  Its current behavior can cause two nodes to run a pod with the same name at the same time when using multiple api servers, which violates the pod safety guarantees on the cluster.  Because a read serviced by the watch cache can be arbitrarily delayed, a client that connects to that api server can read an arbitrarily old history.  We explicitly use quorum reads against etcd to prevent this.

Scenario:

1. T1: StatefulSet controller creates `pod-0` (uid 1) which is scheduled to `node-1`
2. T2: `pod-0` is deleted as part of a rolling upgrade
3. `node-1` sees that `pod-0` is deleted and cleans it up, then deletes the pod in the api
4. The StatefulSet controller creates a second pod `pod-0` (uid 2) which is assigned to `node-2`
5. `node-2` sees that `pod-0` has been scheduled to it and starts `pod-0`
6. The kubelet on `node-1` crashes and restarts, then performs an initial list of pods scheduled to it against an API server in an HA setup (more than one API server) that is partitioned from the master (watch cache is arbitrarily delayed).   The watch cache returns a list of pods from before T2
7. `node-1` fills its local cache with a list of pods from before T2
7. `node-1` starts `pod-0` (uid 1) and `node-2` is already running `pod-0` (uid 2).

This violates pod safety.  Since we support HA api servers, we cannot use `resourceVersion=0` from reflectors on the node, and probably should not use it on the masters.  We can only safely use `resourceVersion=0` after we have retrieved at least one list, and only if we verify that resourceVersion is in the future.

@kubernetes/sig-apps-bugs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-bugs This is a fairly serious issue that can lead to cluster identity guarantees being lost, which means clustered software cannot run safely if it has assumed the pod safety guarantee prevents two pods with the same name running on the cluster at the same time.  The user impact is likely data loss of critical data.

This is also something that could happen for controllers - during a controller lease failover the next leader could be working from a very old cache and undo recent work done.

No matter what, the first list of a component with a clean state that must preserve "happens-before" must perform a live quorum read against etcd to fill their cache.  That can only be done by omitting resourceVersion=0.

Fixes:

1. Disable resourceVersion=0 from being used in reflector list, only use when known safe
2. Disable resourceVersion=0 for first list, and optionally use resourceVersion=0 for subsequent calls if we know the previous resourceVersion is after our current version (assumes resource version monotonicity) 
3. Disable resourceVersion=0 for the first list from a reflector, then send resourceVersion=<latest> on subsequent list calls (which causes the watch cache to wait until the resource version shows up).
4. Perform live reads from Kubelet on all new pods coming in to verify they still exist


1 is a pretty significant performance regression, but is the most correct and safest option (just like we enabled quorum reads everywhere).  2 is more complex, and there are a few people trying to remove the monotonicity guarantees from resource version, but would retain most of the performance benefits of using this in the reflector.  3 is probably less complex than 2, but i'm not positive it actually works.  4 is hideous and won't fix other usage.

Probably needs to be backported to 1.6.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees #59848

148 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees #59848

Description

Activity

smarterclayton commented on Feb 14, 2018

roycaihw commented on Feb 15, 2018

smarterclayton commented on Feb 17, 2018

smarterclayton commented on Feb 17, 2018

smarterclayton commented on Feb 17, 2018

smarterclayton commented on Feb 17, 2018

jberkus commented on Feb 21, 2018

148 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions