Skip to content

'failed to reserve container name' #4604

Closed
@sadortun

Description

@sadortun

Description

Hi!

We are running containerd on GKE with pretty much all defaults. A dozen nodes, and a few hundreds pods. Plenty of memory and disk free.

We started to have many pods fail due to failed to reserve container name error in the last week or so. I do not recall any specific changes to the cluster, or containers themselves.

Any help will be greatly appreciated!

Steps to reproduce the issue:
I have no clue how to specifically reproduce this issue.

Cluster have nothing special, deployment is straightforward. The only thing that could be relevant is that our images are quite large, around 3Gb.

I got a few more details here : https://serverfault.com/questions/1036683/gke-context-deadline-exceeded-createcontainererror-and-failed-to-reserve-contai

Describe the results you received:

2020-10-07T08:01:45Z Successfully assigned default/apps-abcd-6b6cb5876b-nn9md to gke-bap-mtl-1-preemptible-e2-s4-e6a8ddb4-ng3v I 
2020-10-07T08:01:50Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:16:45Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:18:45Z Error: context deadline exceeded W 
2020-10-07T08:18:45Z Container image "redis:4.0-alpine" already present on machine I 
2020-10-07T08:18:53Z Created container redis I 
2020-10-07T08:18:53Z Started container redis I 
2020-10-07T08:18:53Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:02Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:02Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:03Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:20Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:20Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:21Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:34Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:34Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:35Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:44Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:19:44Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:19:54Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:08Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:08Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:20:18Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:30Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:20:30Z Error: failed to reserve container name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0": name "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0" is reserved for "8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f" W 
2020-10-07T08:21:19Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:26:35Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:31:36Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:36:26Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
2020-10-07T08:41:18Z Pulling image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" I 
I 2020-10-07T08:46:41Z Successfully pulled image "gcr.io/my/appImage:223c133ff631c41e1bc21a8b7d7554036da4fb4e" 

Describe the results you expected:
Live an happy life, error free :)

Output of containerd --version:

containerd github.com/containerd/containerd 1.3.2 ff48f57fc83a8c44cf4ad5d672424a98ba37ded6

Any other relevant information:

Activity

windniw

windniw commented on Nov 20, 2020

@windniw

It looks like there is a container with name web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0 and id 8b21a9870e3ecc09bbb92da2036bd3c9b35f5829873d80cfbd14dc1e1827923f in containerd. While kubelet want to create to a new one with name web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0, cri plugin failed on it.

Could you show docker ps -a or ctr c list

pfuhrmann

pfuhrmann commented on Dec 27, 2020

@pfuhrmann

Did you manage to resolve this issue @sadortun? We are experiencing the same. Also on GKE with containerd runtime.

  Normal   Scheduled  2m21s               default-scheduler  Successfully assigned ***-77c979f8bf-px4v9 to gke-***-389a7e33-t1hl
  Warning  Failed     20s                 kubelet            Error: context deadline exceeded
  Normal   Pulled     7s (x3 over 2m20s)  kubelet            Container image "***" already present on machine
  Warning  Failed     7s (x2 over 19s)    kubelet            Error: failed to reserve container name ***-77c979f8bf-px4v9_***": name "***-77c979f8bf-px4v9_***" is reserved for "818fcfef09165d91ac8c86ed88714bb159a8358c3eca473ec07611a51d72b140"

We are deploying the same image to multiple deployments (30 - 40 pods) at the same time. No such issues with docker runtime.

Eventually, kubelet is able to resolve this issue without manual intervention, however, it is significantly slowing the deployment of new images during the release (extra 2-3 minutes to resolve name conflicts).

sadortun

sadortun commented on Dec 28, 2020

@sadortun
Author

Hi @pfuhrmann

We did investigate this quite deeply with GKE Dev team and we were not able to reproduce it.

That said, We are pretty convince the issue comes from one of the two following issue:

  • Disk IO is to high and containerd timeout because of that
  • Starting 10-20+ pods at the same time on a single node cause a memory usage spike and at some point containerd process get killed.

Unfortunately after a month of back and forth with GKE devs, we were not able to find the solution.

The good new is, for us, we refactored our application and were able to reduce the number of starting pods from about 20, down to 5. Since then, we had no issues.

You might also want to increase node boot drive size. It seems to help too.

kmarji

kmarji commented on Apr 25, 2021

@kmarji

any update on this? did anybody manage to solve this? we are facing the same issue

chrisroat

chrisroat commented on May 22, 2021

@chrisroat

We are also seeing the same issue, GKE with containerd. It does seem to be correlated with starting many pods at once.

Switching from cos_containerd back to cos (docker based) seems to have resolved the situation, at least in the short term.

kmarji

kmarji commented on May 22, 2021

@kmarji

We are also seeing the same issue, GKE with containerd. It does seem to be correlated with starting many pods at once.

Switching from cos_containerd back to cos (docker based) seems to have resolved the situation, at least in the short term.

Same for us once we switched back to cos with docker everything worked

sadortun

sadortun commented on May 22, 2021

@sadortun
Author

Same for us once we switched back to cos with docker everything worked.

At the end we still had occasional issues and We also had to switch back to cos

mikebrow

mikebrow commented on May 22, 2021

@mikebrow
Member

jotting down some notes here, apologies if it's lengthy:

Let me try to explain/figure out the reason you got "failed to reserve container name" ..

Kubelet tried to create a container that it had already asked containerd to create at least once.. when containerd tried the first time it received a variable in the container create meta data named attempt and that variable held the default value 0 .. then containerd reserved the unique name for attempt 0 that you see in your log (see _0 at end of name) "web_apps-abcd-6b6cb5876b-nn9md_default_3dc00fd6-0c5d-42be-bec8-e4f6cad616da_0"... something happened causing a context timeout between kubelet and containerd .. the kubelet context timeout value is configurable.. "--runtime-request-timeout duration Default: 2m0s" a 2min timeout could happen for any number of reasons.. an unusually long garbage collection a file system hiccup, locked files, deadlocks while waiting, some very expensive init operation occurring in the node for one of your other containers.. who knows? That's why we have/need recovery procedures.

What should have happened is kubelet should've incremented the attempt number (or at least that's how I see it from this side (the containerd side) of the CRI api, but kubelet did not increment the attempt number and further containerd was still trying to create the container from the first request.. or the create on the containerd side may even be finished at this point, it is possible the timeout only happened on the kubelet side and containerd continued finishing the create, possibly even attempting to return the success result. If containerd actually failed it would have deleted the reservation for that container id as the immediate thing after we reserve the id in containerd is to defer it's removal on any error in the create.. https://github.com/containerd/containerd/blob/master/pkg/cri/server/container_create.go#L65-L84

So ok.. skimming over the kubelet code.. I believe this is the code that decides what attempt number we are on? https://github.com/kubernetes/kubernetes/blame/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L173-L292

In my skim.. I think I see a window where kubelet will try attempt 0 a second time after the first create attempt fails with a context timeout. But I may be reading the code wrong? @dims @feiskyer @Random-Liu

CyberHippo

CyberHippo commented on Jul 20, 2021

@CyberHippo

Bumped into this issue as well. Switching back to cos with docker.

jsoref

jsoref commented on Aug 26, 2021

@jsoref

Fwiw, we're hitting this this week.

k8s 1.20.8-gke.900; containerd://1.4.3

Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.900", GitCommit:"28ab8501be88ea42e897ca8514d7cd0b436253d9", GitTreeState:"clean", BuildDate:"2021-06-30T09:23:36Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}

kubectl get nodes -o json | jq '.items[].status.nodeInfo.containerRuntimeVersion' |uniq
"containerd://1.4.3"

In my case, the pod is owned by a (batch/v1)job, and the job by a (batch/v1beta1)cronjob.

The reserved for item only appears in the error, nothing else seems to know about it

Using Google cloud logging, I can search:

"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" OR "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"

w/ a search range of 2021-08-22 01:58:00.000 AM EDT..2021-08-22 02:03:00.000 AM EDT

This is the first hit:

⚠️ Warning 2021-08-22 02:02:44.000 EDT
backup-test-db-1629612000-cz8ks
"Error: failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32""

And this is the second hit:

🌟 Default
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:44.792364 1653 remote_runtime.go:227] CreateContainer in sandbox "c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"

There are additional hits, but they aren't exciting.

For reference, this search (with the same time params) yields nothing:

("backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" OR "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32")
-"Attempt:0"
"Attempt"

This search

("backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" OR "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32")
"Attempt:0"

yields two entries:

Default
2021-08-22 02:02:45.219 EDT
gke-default-cluster-default-pool-c90133be-6xkd
time="2021-08-22T06:02:44.792102443Z" level=error msg="CreateContainer within sandbox \"c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128\" for &ContainerMetadata{Name:backup-db,Attempt:0,} failed" error="failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""
Default
2021-08-22 02:02:56.899 EDT
gke-default-cluster-default-pool-c90133be-6xkd
time="2021-08-22T06:02:56.899853062Z" level=error msg="CreateContainer within sandbox \"c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128\" for &ContainerMetadata{Name:backup-db,Attempt:0,} failed" error="failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""

(The are additional hits if i extend the time window forward, but as they appear to be identical, other than the timestamp, I don't see any value in repeating them.)

Relevant log events

The best query I've found is:

resource.labels.cluster_name="default-cluster"
"backup-test-db-1629612000-cz8ks_test"

(The former is to limit which part of GCloud to search, and the latter is the search.)

Default
2021-08-22 02:00:03.259 EDT
gke-default-cluster-default-pool-c90133be-6xkd
I0822 06:00:03.259074 1653 kubelet.go:1916] SyncLoop (ADD, "api"): "backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"
Default
2021-08-22 02:00:03.569 EDT
gke-default-cluster-default-pool-c90133be-6xkd
I0822 06:00:03.569830 1653 kuberuntime_manager.go:445] No sandbox for pod "backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)" can be found. Need to start a new one
Default
2021-08-22 02:00:43.213 EDT
gke-default-cluster-default-pool-c90133be-6xkd
I0822 06:00:43.213133    1653 kubelet.go:1954] SyncLoop (PLEG): "backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)", event: &pleg.PodLifecycleEvent{ID:"efe343a0-5641-427c-8a65-1b7dc939432d", Type:"ContainerStarted", Data:"c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128"}
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:44.792364 1653 remote_runtime.go:227] CreateContainer in sandbox "c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"
Default
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:42.840645 1653 pod_workers.go:191] Error syncing pod efe343a0-5641-427c-8a65-1b7dc939432d ("backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"), skipping: failed to "StartContainer" for "backup-db" with CreateContainerError: "context deadline exceeded"
Default
2021-08-22 02:02:45.217 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:44.792589 1653 pod_workers.go:191] Error syncing pod efe343a0-5641-427c-8a65-1b7dc939432d ("backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"), skipping: failed to "StartContainer" for "backup-db" with CreateContainerError: "failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""
Default
2021-08-22 02:02:56.900 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:56.900057 1653 remote_runtime.go:227] CreateContainer in sandbox "c9f8cf0e4fc280b632bf8f4365dccf34f213c5aa4636a4424aab68940d579128" from runtime service failed: rpc error: code = Unknown desc = failed to reserve container name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0": name "backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0" is reserved for "188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32"
Default
2021-08-22 02:02:56.900 EDT
gke-default-cluster-default-pool-c90133be-6xkd
E0822 06:02:56.900309 1653 pod_workers.go:191] Error syncing pod efe343a0-5641-427c-8a65-1b7dc939432d ("backup-test-db-1629612000-cz8ks_test(efe343a0-5641-427c-8a65-1b7dc939432d)"), skipping: failed to "StartContainer" for "backup-db" with CreateContainerError: "failed to reserve container name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\": name \"backup-db_backup-test-db-1629612000-cz8ks_test_efe343a0-5641-427c-8a65-1b7dc939432d_0\" is reserved for \"188e7647efe4e1243a4fb3529c69f95c83e3876d4989ba94a409c652f99a8f32\""
matti

matti commented on Sep 17, 2021

@matti

same, switching back to docker

38 remaining items

sadortun

sadortun commented on Jan 25, 2022

@sadortun
Author

@fuweid

Thanks for your time on this issue.

Unfortunately, I did stop using COS back in 2020 after we could not find a solution.

I'm 97% sure we were using overlayfs and as for the rest I have no way to find this historical data.

Sorry about that.

added a commit that references this issue on Jan 26, 2022
a44a720
added 2 commits that reference this issue on Jan 28, 2022
e1aa429
813a061
fuweid

fuweid commented on Jan 28, 2022

@fuweid
Member

@sadortun I file pr to enhance this. #6478 (comment)

No sure that what different between docker and containerd in GKE. sorry about that.

derekperkins

derekperkins commented on Jan 31, 2022

@derekperkins

We're on GKE 1.21.6-gke1500 and we've been seeing this problem for the last 1-2 months

qiutongs

qiutongs commented on Feb 1, 2022

@qiutongs
Contributor

@sadortun I file pr to enhance this. #6478 (comment)

I got some good results showing this patch improves the latency of "CreateContainer".

  • Setup: GKE 1.20 ubuntu_containerd node with 10 GB boot disk
  • Execution A
    • Add some disk IO stress-ng --io 1 -d 1 --timeout 7200 --hdd-bytes 8M
    • Create a deployment of nginx with 25 replicas
    • Check containerd log
  • Execution B: repeat above steps with a newly built containerd with this patch
  • Result A
    • Seen "failed to reserve container name"
    • The last pod is ready after 15min
  • Result B
    • Not seen "failed to reserve container name"
    • All CreateContainer complete within 2 mins
    • The last pod is ready after 10min

Please note this is based on a couple of experiments, not ample data set. stress-ng doesn't produce stable IOPS so the disk state cannot be the exact same in two cases.

qiutongs

qiutongs commented on Feb 1, 2022

@qiutongs
Contributor

Summary (2022/02)

"failed to reserve container name" error is returned by containerd CRI if there is an in-flight CreateContainer request reserving the same container name (like below).
T1: 1st CreateContainer(XYZ) request is sent. (Timeout on Kubelet side)
T2: 2nd CreateContainer(XYZ) request is sent (Kubelet retry)
T3: 2nd CreateContainer request returns "failed to reserve container name XYZ" error
T4: 1st CreateContainer request is still in-flight…

Don't panic. Given sufficient time, the container and pod will be created successfully, as long as you are using restartPolicy:Always or restartPolicy:OnFailure in PodSpec.

Root Cause and Fix

Slow disk operations((e.g. disk throttle on GKE) are the culprit. What generates lots of disk IO can come from a number of factors: user's disk-heavy workload, big images pulling and containerd CRI implementation.

An unnecessary sync-fs operation was found as part of CreateContainer stack. It is the where CreateContainer gets stuck. The sync-fs is got rid of in #6478. Not only it makes CreateContainer return faster, but it reduces disk IO generated by containerd.

Please note there are perhaps other undiscovered reason contributing to this problem.

Mitigation

  1. If pods are failed, consider to use restartPolicy:Always or restartPolicy:OnFailure in PodSpec
  2. Increase the boot disk IOPS (e.g. upgrade disk type or increase disk size)
  3. Upgrade containerd with this patch oci: use readonly mount to read user/group info #6478 which will be available in 1.6+ and 1.5.X(backport working in progress)
locked as resolved and limited conversation to collaborators on Feb 1, 2022
added a commit that references this issue on Feb 3, 2022
61be716
added a commit that references this issue on Feb 8, 2022
a3da590
added a commit that references this issue on Apr 21, 2022
dec977b
added a commit that references this issue on Oct 23, 2024
f022f7c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Participants

    @matti@elucidsoft@jsturtevant@sadortun@chrisroat

    Issue actions

      'failed to reserve container name' · Issue #4604 · containerd/containerd