Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees #59848

Open
smarterclayton opened this issue Feb 14, 2018 · 89 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Feb 14, 2018

When we added resourceVersion=0 to reflectors, we didn't properly reason about its impact on nodes. Its current behavior can cause two nodes to run a pod with the same name at the same time when using multiple api servers, which violates the pod safety guarantees on the cluster. Because a read serviced by the watch cache can be arbitrarily delayed, a client that connects to that api server can read an arbitrarily old history. We explicitly use quorum reads against etcd to prevent this.

Scenario:

  1. T1: StatefulSet controller creates pod-0 (uid 1) which is scheduled to node-1
  2. T2: pod-0 is deleted as part of a rolling upgrade
  3. node-1 sees that pod-0 is deleted and cleans it up, then deletes the pod in the api
  4. The StatefulSet controller creates a second pod pod-0 (uid 2) which is assigned to node-2
  5. node-2 sees that pod-0 has been scheduled to it and starts pod-0
  6. The kubelet on node-1 crashes and restarts, then performs an initial list of pods scheduled to it against an API server in an HA setup (more than one API server) that is partitioned from the master (watch cache is arbitrarily delayed). The watch cache returns a list of pods from before T2
  7. node-1 fills its local cache with a list of pods from before T2
  8. node-1 starts pod-0 (uid 1) and node-2 is already running pod-0 (uid 2).

This violates pod safety. Since we support HA api servers, we cannot use resourceVersion=0 from reflectors on the node, and probably should not use it on the masters. We can only safely use resourceVersion=0 after we have retrieved at least one list, and only if we verify that resourceVersion is in the future.

@kubernetes/sig-apps-bugs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-bugs This is a fairly serious issue that can lead to cluster identity guarantees being lost, which means clustered software cannot run safely if it has assumed the pod safety guarantee prevents two pods with the same name running on the cluster at the same time. The user impact is likely data loss of critical data.

This is also something that could happen for controllers - during a controller lease failover the next leader could be working from a very old cache and undo recent work done.

No matter what, the first list of a component with a clean state that must preserve "happens-before" must perform a live quorum read against etcd to fill their cache. That can only be done by omitting resourceVersion=0.

Fixes:

  1. Disable resourceVersion=0 from being used in reflector list, only use when known safe
  2. Disable resourceVersion=0 for first list, and optionally use resourceVersion=0 for subsequent calls if we know the previous resourceVersion is after our current version (assumes resource version monotonicity)
  3. Disable resourceVersion=0 for the first list from a reflector, then send resourceVersion= on subsequent list calls (which causes the watch cache to wait until the resource version shows up).
  4. Perform live reads from Kubelet on all new pods coming in to verify they still exist

1 is a pretty significant performance regression, but is the most correct and safest option (just like we enabled quorum reads everywhere). 2 is more complex, and there are a few people trying to remove the monotonicity guarantees from resource version, but would retain most of the performance benefits of using this in the reflector. 3 is probably less complex than 2, but i'm not positive it actually works. 4 is hideous and won't fix other usage.

Probably needs to be backported to 1.6.

@smarterclayton smarterclayton added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Feb 14, 2018
@smarterclayton smarterclayton added this to the v1.10 milestone Feb 14, 2018
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. kind/bug Categorizes issue or PR as related to a bug. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Feb 14, 2018
@smarterclayton smarterclayton changed the title Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one master Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one api server Feb 14, 2018
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Feb 14, 2018

Note that we added this to improve large cluster performance, so we are almost certainly going to regress to some degree (how much has been mitigated by other improvements in the last year is uncertain)

@roycaihw
Copy link
Member

/sub

@smarterclayton
Copy link
Contributor Author

Disabling resourceVersion=0 for lists when more than one master is present is an option as well.

As Jordan noted the optimization here is impactful because we avoid having to fetch many full pod lists from etcd when we only return a subset to each node. It’s possible we could require all resourceVersion=0 calls to acquire a read lease on etcd which bounds the delay, but doesn’t guarantee happens before if the cache is delayed. If the watch returned a freshness guarantee we could synchronize on that as well.

@smarterclayton
Copy link
Contributor Author

We can do a synthetic write to the range and wait until it is observed by the cache, then service the rv=0. The logical equivalent is a serializable read on etcd for the range, but we need to know the highest observed RV on the range.

We can accomplish that by executing a range request with min create and mod revisions equal to the latest observed revision, and then performing the watch cache list at the highest RV on the range.

So:

  1. All reflectors need to preserve a “happens before” guarantee when fetching for general sanity - use of rv=0 breaks that
  2. rv=0 is a client visible optimization, but it’s not really necessary except for clients that can observe arbitrarily delayed history safely
  3. Not many clients can observe arbitrarily delayed history safely
  4. We should stop trearing rv=0 specially

We can mitigate the performance impact by ensuring that an initial list from the watch cache preserves happens before. We can safely serve historical lists (rv=N) at any time. The watch cache already handles rv=N mostly correctly.

So the proposed change here is:

  1. Clients remove rv=0 and we stop honoring it
  2. Verify the watch cache correctly waits for rv=N queries
  3. We can serve a list or get from watch cache iff we perform a serializable read against etcd and retrieve the highest create/mod revisions
  4. We can make that query efficient by using min_create/mod_revision on the etcd list call
  5. Clients currently sending rv=0 where perf is critical (node) should start using the last observed resource version, which allows the watch cache to serve correctly.

That resolves this issue.

@kow3ns kow3ns added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 17, 2018
@smarterclayton
Copy link
Contributor Author

Serializable read over the range, that is.

@smarterclayton
Copy link
Contributor Author

Also:

  1. The next time someone adds a cache at any layer, we need to have a good review process that catches this. Probably a checklist we add into api review and an audit of any existing caches.

@jberkus
Copy link

jberkus commented Feb 21, 2018

if this MUST be resolved for 1.10, please add status/approved-for-milestone to it before Code Freeze. Thanks!

@jdumars
Copy link
Member

jdumars commented Feb 23, 2018

Hi Clayton, could you go ahead and add "approved-for-milestone" label to this, as well as status (in progress)? That will help it stay in the milestone if this is a 1.10 blocker. Thanks!

@smarterclayton
Copy link
Contributor Author

Caveat on the label - still trying to identify how we fix this - but no matter what because this is a critical data integrity issues we'll end up back porting this several releases. Will try to get a better estimate of timeframe soon.

@kow3ns kow3ns added this to Backlog in Workloads Feb 26, 2018
@jberkus
Copy link

jberkus commented Feb 26, 2018

ooops, fixing labels.

@smarterclayton
Copy link
Contributor Author

This can happen on a singleton master using an HA etcd cluster:

  1. apiserver is talking to member 0 in the cluster
  2. apiserver crashes and watch cache performs its initial list at RV=10
  3. kubelet deletes pod at RV=11
  4. a new pod is created at RV=12 on a different node
  5. kubelet crashes
  6. apiserver becomes partitioned from the etcd majority but is able to reestablish a watch from the partitioned member at RV=10
  7. kubelet contacts the apiserver and sees RV=10, launches the pod

@smarterclayton smarterclayton changed the title Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one api server Kubernetes is vulnerable to stale reads for critical pod safety guarantees Feb 28, 2018
@smarterclayton smarterclayton changed the title Kubernetes is vulnerable to stale reads for critical pod safety guarantees Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees Feb 28, 2018
@wojtek-t
Copy link
Member

wojtek-t commented Mar 1, 2018

So the proposed change here is:

  1. Clients remove rv=0 and we stop honoring it
  2. Verify the watch cache correctly waits for rv=N queries
  3. We can serve a list or get from watch cache iff we perform a serializable read against etcd and retrieve the highest create/mod revisions
  4. We can make that query efficient by using min_create/mod_revision on the etcd list call
  5. Clients currently sending rv=0 where perf is critical (node) should start using the last observed resource version, which allows the watch cache to serve correctly.

Why do we need (3) and (4)?
If we verify that rv=N works fine (and it should), why can't we just start setting rv= instead.
If we just do that there are two options:

  1. that RV is already in watch cache, and then we can safely serve from watch cache
  2. that RV is not yet in watch cache (watch cache is delayed) and then we will wait for some time and if it won't become fresh enough within some threshold we will reject with some error IIRC.

In the second case, reflector may retry without setting any RV for consistent read from etcd (that should be pretty rare so it should be fine from performance point of view).

@smarterclayton What an I missing?

@wojtek-t
Copy link
Member

wojtek-t commented Mar 1, 2018

@kubernetes/sig-scalability-misc

@jdumars
Copy link
Member

jdumars commented Mar 1, 2018

Please stay in close contact with the release team on this.

@julienlau
Copy link

Hi,

Maybe it would be a good idea to have a set of options to completely disable watch cache ?
As I understand today we can disable watch-cache in api-server by using --watch-cache=false but the local cache cannot be disabled in controllers.

These options would be helpful for testing purpose, but also would open more broadly the way etcd shims such as Kine or FoundationDB

@embano1
Copy link
Member

embano1 commented Oct 15, 2021

As I understand today we can disable watch-cache in api-server by using --watch-cache=false but the local cache cannot be disabled in controllers.

What do you mean by local cache? The cache.Store interface? What would be the benefit or use case of not using a local cache?

@julienlau
Copy link

I am not very familiar with exactly how many caching layers there are between etcd and final client, but the idea would be to deactivate all of them except the one in the final client (we do not convert Watch in Get semantic).
As I understand the final client that currently observe stale reads are controllers and the source of truth is etcd.
Api-server is for sure one layer of cache, but I have the feeling that due to shared-cache / local-cache etcd there can also be several layers of cache after Api-server.

@embano1
Copy link
Member

embano1 commented Oct 19, 2021

@julienlau you might want to read the details shared in this project, explaining the time travel issues in detail and how to detect/prevent them: https://github.com/sieve-project/sieve

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 2, 2021

Just setting a status, this is still fundamentally broken in Kube, and the use of watch cache is unsafe in HA apiservers in a way that allows two kubelets to be running a pod with the same name at the same time, which violates assumptions that statefulsets and PVC depend on. Any cache that is HA must establish that it has the freshest data (data from time T > time request was made) to preserve the guarantees that Kube controllers implicitly rely on.

I have not seen a solution proposed other than the watch cache being required to verify the cache is up to date before serving a LIST @ rv=0, but we do need to address this soon.

I'm open to arguments that the pod exclusion property is not fundamental but I haven't seen any credible ones in the last five years :)

@jberkus
Copy link

jberkus commented Nov 3, 2021

Welcome to the CAP theorem, I guess?

@julienlau
Copy link

julienlau commented Nov 4, 2021

Maybe defining specific test cases to work it out would help ?
It seems to me that several other issues linked to this one use some sort of workaround to get things working but could use a proper way to handle things out at controller level.
Sieve seem to be designed to debug controllers, but the same failure scenarii could be tested at Apiserver level to test apiserver watch cache and loadbalancing behavior ?
Maybe a read-through write-through debug interface to Apiserver would also help (it may be what you are talking about when using LIST @ rv=0) ?

@wojtek-t
Copy link
Member

I wanted to make an update on this for quite some time and finally getting here (motivated by a need to link it)

With #83520 (so since 1.18), this problem has much lower exposure.

That particular PR ensures, then when a reflector has to relist (e.g. because of out-of-history in watch or whatever other reason), the relist will NOT go back in time (we either list from cache by providing the currently known RV which ensures that our LIST will be at least that fresh or list from etcd using quorum reads).

This means, that within a single incarnation of a component, it will never go back in time.

The remaining (and unsolved case) is what happens on component start. It still lists from cache then (without any preconditions), which means that it can go back in time comparing to where it was before it was restarted.

But within a single incarnation of a component - we're now safe.

@julienlau
Copy link

Update appreciated.
Ok for RV usage in general, but regarding corner cases you mentionned "component start".
In your view does kubernetes updates also checks these criteria ?

@redbaron
Copy link
Contributor

redbaron commented Jan 27, 2022

It still lists from cache then (without any preconditions), which means that it can go back in time comparing to where it was before it was restarted.

Naive solution, which likely is not applicable, but I'll ask nevertheless, is listing from etcd with quorum read an option here?

@wojtek-t
Copy link
Member

Naive solution, which likely is not applicable, but I'll ask nevertheless, is listing from etcd with quorum read an option here?

This break scalability (especially badly for some resource, pods in particular, because you can't list "pods from my node" from etcd - you can only list all pods and then deserialize in kube-apiserver and filter out. For each node independently. If you have 5k-node cluster with 150k pods that had an incident and all of kubelets are now starting at the same time ....]

@embano1
Copy link
Member

embano1 commented Jan 27, 2022

With https://github.com/sieve-project/sieve we found several bugs in real-world controllers, some of those caused by time travel anomalies.

While not a perfect mitigation (bc always depending on the problem domain), using optimistic concurrency control patterns not only during updates but also (conditional) deletes, can mitigate some of the time travel anomalies by fencing off operations from stale views. Again, not perfect, but a good practice to not sacrifice scalability.

On smaller clusters, disabling the API server cache might be a safe alternative, too. This eliminates the aforementioned (re)start time travel anomaly in controllers.

Thoughts @wojtek-t ?

@wojtek-t
Copy link
Member

Sure - those are all good, but are only partial mitigations, not a full solution. And what we're trying to do is to provide full solution.
FWIW - kubernetes/enhancements#3142 is one of the things we're looking at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
Status: Needs Triage
Workloads
  
Backlog
Development

No branches or pull requests