Skip to content

Pods stuck on terminating #51835

Closed
Closed
@igorleao

Description

@igorleao

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
Pods stuck on terminating for a long time

What you expected to happen:
Pods get terminated

How to reproduce it (as minimally and precisely as possible):

  1. Run a deployment
  2. Delete it
  3. Pods are still terminating

Anything else we need to know?:
Kubernetes pods stuck as Terminating for a few hours after getting deleted.

Logs:
kubectl describe pod my-pod-3854038851-r1hc3

Name:				my-pod-3854038851-r1hc3
Namespace:			container-4-production
Node:				ip-172-16-30-204.ec2.internal/172.16.30.204
Start Time:			Fri, 01 Sep 2017 11:58:24 -0300
Labels:				pod-template-hash=3854038851
				release=stable
				run=my-pod-3
Annotations:			kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"container-4-production","name":"my-pod-3-3854038851","uid":"5816c...
				prometheus.io/scrape=true
Status:				Terminating (expires Fri, 01 Sep 2017 14:17:53 -0300)
Termination Grace Period:	30s
IP:
Created By:			ReplicaSet/my-pod-3-3854038851
Controlled By:			ReplicaSet/my-pod-3-3854038851
Init Containers:
  ensure-network:
    Container ID:	docker://guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/ensure-network@sha256:guid-0
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		True
    Restart Count:	0
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Containers:
  container-1:
    Container ID:	docker://container-id-guid-1
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-1@sha256:guid-2
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	1G
    Requests:
      cpu:	100m
      memory:	1G
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-2:
    Container ID:	docker://container-id-guid-2
    Image:		alpine:3.4
    Image ID:		docker-pullable://alpine@sha256:alpine-container-id-1
    Port:		<none>
    Command:
      X
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	20m
      memory:	40M
    Requests:
      cpu:		10m
      memory:		20M
    Environment:	<none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-3:
    Container ID:	docker://container-id-guid-3
    Image:		XXXXX
    Image ID:		docker-pullable://repo/container-3@sha256:guid-3
    Port:		<none>
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	100m
      memory:	200M
    Requests:
      cpu:	100m
      memory:	100M
    Readiness:	exec [nc -zv localhost 80] delay=1s timeout=1s period=5s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
  container-4:
    Container ID:	docker://container-id-guid-4
    Image:		XXXX
    Image ID:		docker-pullable://repo/container-4@sha256:guid-4
    Port:		9102/TCP
    State:		Terminated
      Exit Code:	0
      Started:		Mon, 01 Jan 0001 00:00:00 +0000
      Finished:		Mon, 01 Jan 0001 00:00:00 +0000
    Ready:		False
    Restart Count:	0
    Limits:
      cpu:	600m
      memory:	1500M
    Requests:
      cpu:	600m
      memory:	1500M
    Readiness:	http-get http://:8080/healthy delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      XXXX
    Mounts:
      /app/config/external from volume-2 (ro)
      /data/volume-1 from volume-1 (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-xxxxx (ro)
Conditions:
  Type		Status
  Initialized 	True
  Ready 	False
  PodScheduled 	True
Volumes:
  volume-1:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	volume-1
    Optional:	false
  volume-2:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	external
    Optional:	false
  default-token-xxxxx:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-xxxxx
    Optional:	false
QoS Class:	Burstable
Node-Selectors:	<none>

sudo journalctl -u kubelet | grep "my-pod"

[...]
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing address using workloadID" Workload=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Releasing all IPs with handle 'my-pod-3854038851-r1hc3'"
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-3854038851-r1hc3 workloadId=my-pod-3854038851-r1hc3
Sep 01 17:17:56 ip-172-16-30-204 kubelet[9619]: time="2017-09-01T17:17:56Z" level=info msg="Teardown processing complete." Workload=my-pod-3854038851-r1hc3 endpoint=<nil>
Sep 01 17:19:06 ip-172-16-30-204 kubelet[9619]: I0901 17:19:06.591946    9619 kubelet.go:1824] SyncLoop (DELETE, "api"):my-pod-3854038851(b8cf2ecd-8f25-11e7-ba86-0a27a44c875)"

sudo journalctl -u docker | grep "docker-id-for-my-pod"

Sep 01 17:17:55 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:55.695834447Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"
Sep 01 17:17:56 ip-172-16-30-204 dockerd[9385]: time="2017-09-01T17:17:56.698913805Z" level=error msg="Handler for POST /v1.24/containers/docker-id-for-my-pod/stop returned error: Container docker-id-for-my-pod is already stopped"

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"clean", BuildDate:"2017-08-03T15:13:53Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.6", GitCommit:"7fa1c1756d8bc963f1a389f4a6937dc71f08ada2", GitTreeState:"clean", BuildDate:"2017-06-16T18:21:54Z", GoVersion:"go1.7.6", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration**:
    AWS

  • OS (e.g. from /etc/os-release):
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Kernel (e.g. uname -a):
    Linux ip-172-16-30-204 3.10.0-327.10.1.el7.x86_64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:
    Kops

  • Others:
    Docker version 1.12.6, build 78d1802

@kubernetes/sig-aws @kubernetes/sig-scheduling

Activity

added
needs-sigIndicates an issue or PR lacks a `sig/foo` label and requires one.
on Sep 1, 2017
igorleao

igorleao commented on Sep 1, 2017

@igorleao
Author

@kubernetes/sig-aws @kubernetes/sig-scheduling

resouer

resouer commented on Sep 3, 2017

@resouer
Contributor

Usually volume and network cleanup consume more time in termination. Can you find in which phase your pod is stuck? Volume cleanup for example?

dixudx

dixudx commented on Sep 3, 2017

@dixudx
Member

Usually volume and network cleanup consume more time in termination.

Correct. They are always suspect.

@igorleao You can try kubectl delete pod xxx --now as well.

igorleao

igorleao commented on Sep 4, 2017

@igorleao
Author

Hi @resouer and @dixudx
I'm not sure. Looking at kubelet logs for a different pod with the same problem, I found:

Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Releasing address using workloadID" Workload=my-pod-969733955-rbxhn
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Releasing all IPs with handle 'my-pod-969733955-rbxhn'"
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=warning msg="Asked to release address but it doesn't exist. Ignoring" Workload=my-pod-969733955-rbxhn workloadId=my-pod-969733955-rbxhn
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: time="2017-09-02T15:31:57Z" level=info msg="Teardown processing complete." Workload=my-pod-969733955-rbxhn endpoint=<nil>
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.496132    9620 qos_container_manager_linux.go:285] [ContainerManager]: Updated QoS cgroup configuration
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968147    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (spec.Name: "default-token-wrlv3") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968245    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-token-key" (spec.Name: "token-key") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968537    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-token-key\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968508761 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-token-key" (volume.spec.Name: "token-key") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968744    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-default-token-wrlv3\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968719924 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (volume.spec.Name: "default-token-wrlv3") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/default-token-wrlv3 /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_default-token-wrlv3.deleting~940140790: device or resource busy
--
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778742    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_default-token-wrlv3.deleting~940140790" (spec.Name: "wrapped_default-token-wrlv3.deleting~940140790") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778753    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~850807831" (spec.Name: "wrapped_token-key.deleting~850807831") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778764    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~413655961" (spec.Name: "wrapped_token-key.deleting~413655961") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778774    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~818780979" (spec.Name: "wrapped_token-key.deleting~818780979") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778784    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~348212189" (spec.Name: "wrapped_token-key.deleting~348212189") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778796    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~848395852" (spec.Name: "wrapped_token-key.deleting~848395852") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778808    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_default-token-wrlv3.deleting~610264100" (spec.Name: "wrapped_default-token-wrlv3.deleting~610264100") devicePath: ""
Sep 02 15:33:04 ip-172-16-30-208 kubelet[9620]: I0902 15:33:04.778820    9620 reconciler.go:363] Detached volume "kubernetes.io/secret/GUID-wrapped_token-key.deleting~960022821" (spec.Name: "wrapped_token-key.deleting~960022821") devicePath: ""
Sep 02 15:33:05 ip-172-16-30-208 kubelet[9620]: I0902 15:33:05.081380    9620 server.go:778] GET /stats/summary/: (37.027756ms) 200 [[Go-http-client/1.1] 10.0.46.202:54644]
Sep 02 15:33:05 ip-172-16-30-208 kubelet[9620]: I0902 15:33:05.185367    9620 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/GUID-calico-token-w8tzx" (spec.Name: "calico-token-w8tzx") pod "GUID" (UID: "GUID").
Sep 02 15:33:07 ip-172-16-30-208 kubelet[9620]: I0902 15:33:07.187953    9620 kubelet.go:1824] SyncLoop (DELETE, "api"): "my-pod-969733955-rbxhn_container-4-production(GUID)"
Sep 02 15:33:13 ip-172-16-30-208 kubelet[9620]: I0902 15:33:13.879940    9620 aws.go:937] Could not determine public DNS from AWS metadata.
Sep 02 15:33:20 ip-172-16-30-208 kubelet[9620]: I0902 15:33:20.736601    9620 server.go:778] GET /metrics: (53.063679ms) 200 [[Prometheus/1.7.1] 10.0.46.198:43576]
Sep 02 15:33:23 ip-172-16-30-208 kubelet[9620]: I0902 15:33:23.898078    9620 aws.go:937] Could not determine public DNS from AWS metadata.

As you can see, this cluster has Calico for CNI.
The following lines bring my attention:

Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: I0902 15:31:57.968245    9620 reconciler.go:201] UnmountVolume operation started for volume "kubernetes.io/secret/GUID-token-key" (spec.Name: "token-key") from pod "GUID" (UID: "GUID").
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968537    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-token-key\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968508761 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-token-key" (volume.spec.Name: "token-key") pod "GUID" (UID: "GUID") with: rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy
Sep 02 15:31:57 ip-172-16-30-208 kubelet[9620]: E0902 15:31:57.968744    9620 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/secret/GUID-default-token-wrlv3\" (\"GUID\")" failed. No retries permitted until 2017-09-02 15:31:59.968719924 +0000 UTC (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "kubernetes.io/secret/GUID-default-token-wrlv3" (volume.spec.Name: "default-token-wrlv3") pod "GUID" (UID: "GUID") with: rename 

Is there a better way find out which phase a pod is stuck?

kubectl delete pod xxx --now seems to work pretty well, but I really wish to find out its root cause and avoid human interaction.

dixudx

dixudx commented on Sep 4, 2017

@dixudx
Member
rename /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/wrapped_token-key.deleting~818780979: device or resource busy

Seems kubelet/mount failed to mount configmap as a volume due to such file renaming.

@igorleao Is this reproducible? Or it is just not that stable, happening occasionally. I've met such errors before, just to make sure.

igorleao

igorleao commented on Sep 4, 2017

@igorleao
Author

@dixudx it happens several times a day for a certain cluster. Others clusters created with the same verstion of kops and kubernetes, in the same week, work just fine.

jingxu97

jingxu97 commented on Sep 13, 2017

@jingxu97
Contributor

@igorleao As the log shows that the volume manager failed to remove the secrete directory because device is busy.
Could you please check whether the directory /var/lib/kubelet/pods/GUID/volumes/kubernetes.io~secret/token-key is still mounted or not? Thanks!

r7vme

r7vme commented on Sep 26, 2017

@r7vme

@igorleao how do you run kubelet? in container? if so can you please post your systemd unit or docker config for kubelet?

We see similar behaviour. We run kubelet as container and problem was partially mitigated by mounting /var/lib/kubelet as shared (by default docker mounts volume as rslave). But still we see similar issues, but less frequent. Currently i suspect that some other mounts should be done different way (e.g. /var/lib/docker or /rootfs)

r7vme

r7vme commented on Sep 28, 2017

@r7vme

@stormltf Can you please post your kubelet container configuration?

r7vme

r7vme commented on Sep 29, 2017

@r7vme

@stormltf you're running kubelet in container and don't use --containerized flag (which do some tricks with mounts). Which basically means that all mounts that kubelet does will be done in container mount namespace. Good thing that they will be proposed back to host machine's namespace (as you have /var/lib/kubelet as shared), but i'm not sure what happens is namespace removed (when kubelet container removed).

Can you please for stuck pods do following:

on node where pod is running

  • docker exec -ti /kubelet /bin/bash -c "mount | grep STUCK_POD_UUID"
  • and same on node itself mount | grep STUCK_POD_UUID.

Please also do same for freshly created pod. I excpect to see some /var/lib/kubelet mounts (e.g. default-secret)

r7vme

r7vme commented on Oct 11, 2017

@r7vme

@stormltf did you restart kubelet after first two pods were created?

r7vme

r7vme commented on Oct 12, 2017

@r7vme

@stormltf You can try to make /var/lib/docker and /rootfs as shared (which i don't see in your docker inspect, but see inside container) mountpoint.

213 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.sig/nodeCategorizes an issue or PR as relevant to SIG Node.sig/storageCategorizes an issue or PR as relevant to SIG Storage.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Paxa@jinleileiking@sabbour@snackycracky@windoze

        Issue actions

          Pods stuck on terminating · Issue #51835 · kubernetes/kubernetes