Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreDNS not started with k8s 1.11 and weave (CentOS 7) #998

Closed
carlosrmendes opened this issue Jul 17, 2018 · 33 comments · Fixed by kubernetes/website#9872
Closed

CoreDNS not started with k8s 1.11 and weave (CentOS 7) #998

carlosrmendes opened this issue Jul 17, 2018 · 33 comments · Fixed by kubernetes/website#9872
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@carlosrmendes
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version 1.11

Environment:

  • Kubernetes version (use kubectl version): 1.11
  • Cloud provider or hardware configuration: aws ec2 with (16vcpus 64gb RAM)
  • OS (e.g. from /etc/os-release): centos 7
  • Kernel (e.g. uname -a): 3.10.0-693.17.1.el7.x86_64
  • Others: weave as cni add-on

What happened?

after kubeadm init the coreos pods stay in Error

NAME                                   READY     STATUS    RESTARTS   AGE
coredns-78fcdf6894-ljdjp               0/1       Error     6          9m
coredns-78fcdf6894-p6flm               0/1       Error     6          9m
etcd-master                            1/1       Running   0          8m
heapster-5bbdfbff9f-h5h2n              1/1       Running   0          9m
kube-apiserver-master                  1/1       Running   0          8m
kube-controller-manager-master         1/1       Running   0          8m
kube-proxy-5642r                       1/1       Running   0          9m
kube-scheduler-master                  1/1       Running   0          8m
kubernetes-dashboard-6948bdb78-bwkvx   1/1       Running   0          9m
weave-net-r5jkg                        2/2       Running   0          9m

The logs of both pods show the following:
standard_init_linux.go:178: exec user process caused "operation not permitted"

@neolit123
Copy link
Member

@kubernetes/sig-network-bugs

@carlosmkb, what is your docker version?

@timothysc
Copy link
Member

timothysc commented Jul 17, 2018

I find this hard to believe, we pretty extensively test CentOS 7 on our side.

Do you have the system and pod logs?

@carlosrmendes
Copy link
Author

@dims , can make sense, I will try

@neolit123 and @timothysc

docker version: docker-1.13.1-63.git94f4240.el7.centos.x86_64

coredns pods log: standard_init_linux.go:178: exec user process caused "operation not permitted"
system log journalctl -xeu kubelet:

Jul 17 23:45:17 server.raid.local kubelet[20442]: E0717 23:45:17.679867   20442 pod_workers.go:186] Error syncing pod dd030886-89f4-11e8-9786-0a92797fa29e ("cas-7d6d97c7bd-mzw5j_raidcloud(dd030886-89f4-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "cas" with ImagePullBackOff: "Back-off pulling image \"registry.raidcloud.io/raidcloud/cas:180328.pvt.01\""
Jul 17 23:45:18 server.raid.local kubelet[20442]: I0717 23:45:18.679059   20442 kuberuntime_manager.go:513] Container {Name:json2ldap Image:registry.raidcloud.io/raidcloud/json2ldap:180328.pvt.01 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:default-token-f2cmq ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:30,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:18 server.raid.local kubelet[20442]: E0717 23:45:18.680001   20442 pod_workers.go:186] Error syncing pod dcc39ce2-89f4-11e8-9786-0a92797fa29e ("json2ldap-666fc85686-tmxrr_raidcloud(dcc39ce2-89f4-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "json2ldap" with ImagePullBackOff: "Back-off pulling image \"registry.raidcloud.io/raidcloud/json2ldap:180328.pvt.01\""
Jul 17 23:45:21 server.raid.local kubelet[20442]: I0717 23:45:21.678232   20442 kuberuntime_manager.go:513] Container {Name:coredns Image:k8s.gcr.io/coredns:1.1.3 Command:[] Args:[-conf /etc/coredns/Corefile] WorkingDir: Ports:[{Name:dns HostPort:0 ContainerPort:53 Protocol:UDP HostIP:} {Name:dns-tcp HostPort:0 ContainerPort:53 Protocol:TCP HostIP:} {Name:metrics HostPort:0 ContainerPort:9153 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[memory:{i:{value:178257920 scale:0} d:{Dec:<nil>} s:170Mi Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:73400320 scale:0} d:{Dec:<nil>} s:70Mi Format:BinarySI}]} VolumeMounts:[{Name:config-volume ReadOnly:true MountPath:/etc/coredns SubPath: MountPropagation:<nil>} {Name:coredns-token-6nhgg ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:5,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_BIND_SERVICE],Drop:[all],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:21 server.raid.local kubelet[20442]: I0717 23:45:21.678311   20442 kuberuntime_manager.go:757] checking backoff for container "coredns" in pod "coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)"
Jul 17 23:45:21 server.raid.local kubelet[20442]: I0717 23:45:21.678404   20442 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)
Jul 17 23:45:21 server.raid.local kubelet[20442]: E0717 23:45:21.678425   20442 pod_workers.go:186] Error syncing pod 9b44aa92-89f7-11e8-9786-0a92797fa29e ("coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-znfvw_kube-system(9b44aa92-89f7-11e8-9786-0a92797fa29e)"
Jul 17 23:45:22 server.raid.local kubelet[20442]: I0717 23:45:22.679145   20442 kuberuntime_manager.go:513] Container {Name:login Image:registry.raidcloud.io/raidcloud/admin:180329.pvt.05 Command:[] Args:[] WorkingDir: Ports:[] EnvFrom:[] Env:[] Resources:{Limits:map[] Requests:map[]} VolumeMounts:[{Name:login-config ReadOnly:true MountPath:/usr/share/nginx/conf/ SubPath: MountPropagation:<nil>} {Name:default-token-f2cmq ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:5,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:22 server.raid.local kubelet[20442]: E0717 23:45:22.679941   20442 pod_workers.go:186] Error syncing pod dc8392a9-89f4-11e8-9786-0a92797fa29e ("login-85ffb66bb8-5l9fq_raidcloud(dc8392a9-89f4-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "login" with ImagePullBackOff: "Back-off pulling image \"registry.raidcloud.io/raidcloud/admin:180329.pvt.05\""
Jul 17 23:45:23 server.raid.local kubelet[20442]: I0717 23:45:23.678172   20442 kuberuntime_manager.go:513] Container {Name:coredns Image:k8s.gcr.io/coredns:1.1.3 Command:[] Args:[-conf /etc/coredns/Corefile] WorkingDir: Ports:[{Name:dns HostPort:0 ContainerPort:53 Protocol:UDP HostIP:} {Name:dns-tcp HostPort:0 ContainerPort:53 Protocol:TCP HostIP:} {Name:metrics HostPort:0 ContainerPort:9153 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[memory:{i:{value:178257920 scale:0} d:{Dec:<nil>} s:170Mi Format:BinarySI}] Requests:map[cpu:{i:{value:100 scale:-3} d:{Dec:<nil>} s:100m Format:DecimalSI} memory:{i:{value:73400320 scale:0} d:{Dec:<nil>} s:70Mi Format:BinarySI}]} VolumeMounts:[{Name:config-volume ReadOnly:true MountPath:/etc/coredns SubPath: MountPropagation:<nil>} {Name:coredns-token-6nhgg ReadOnly:true MountPath:/var/run/secrets/kubernetes.io/serviceaccount SubPath: MountPropagation:<nil>}] VolumeDevices:[] LivenessProbe:&Probe{Handler:Handler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/health,Port:8080,Host:,Scheme:HTTP,HTTPHeaders:[],},TCPSocket:nil,},InitialDelaySeconds:60,TimeoutSeconds:5,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:5,} ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy:File ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[NET_BIND_SERVICE],Drop:[all],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:*true,AllowPrivilegeEscalation:*false,RunAsGroup:nil,} Stdin:false StdinOnce:false TTY:false} is dead, but RestartPolicy says that we should restart it.
Jul 17 23:45:23 server.raid.local kubelet[20442]: I0717 23:45:23.678412   20442 kuberuntime_manager.go:757] checking backoff for container "coredns" in pod "coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)"
Jul 17 23:45:23 server.raid.local kubelet[20442]: I0717 23:45:23.678532   20442 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)
Jul 17 23:45:23 server.raid.local kubelet[20442]: E0717 23:45:23.678554   20442 pod_workers.go:186] Error syncing pod 9b45a068-89f7-11e8-9786-0a92797fa29e ("coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)"), skipping: failed to "StartContainer" for "coredns" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=coredns pod=coredns-78fcdf6894-lcqt5_kube-system(9b45a068-89f7-11e8-9786-0a92797fa29e)"

@chrisohaver
Copy link

Found a couple instances of the same errors reported in other scenarios in the past.
Might try removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps.

@sectorsize512
Copy link

Same issue for me. Similar setup CentOS 7.4.1708, Docker version 1.13.1, build 94f4240/1.13.1 (comes with CentOS):

[root@faas-A01 ~]# kubectl get pods --all-namespaces
NAMESPACE     NAME                                                 READY     STATUS             RESTARTS   AGE
kube-system   calico-node-2vssv                                    2/2       Running            0          9m
kube-system   calico-node-4vr7t                                    2/2       Running            0          7m
kube-system   calico-node-nlfnd                                    2/2       Running            0          17m
kube-system   calico-node-rgw5w                                    2/2       Running            0          23m
kube-system   coredns-78fcdf6894-p4wbl                             0/1       CrashLoopBackOff   9          30m
kube-system   coredns-78fcdf6894-r4pwf                             0/1       CrashLoopBackOff   9          30m
kube-system   etcd-faas-a01.sl.cloud9.ibm.com                      1/1       Running            0          29m
kube-system   kube-apiserver-faas-a01.sl.cloud9.ibm.com            1/1       Running            0          29m
kube-system   kube-controller-manager-faas-a01.sl.cloud9.ibm.com   1/1       Running            0          29m
kube-system   kube-proxy-55csj                                     1/1       Running            0          17m
kube-system   kube-proxy-56r8c                                     1/1       Running            0          30m
kube-system   kube-proxy-kncql                                     1/1       Running            0          9m
kube-system   kube-proxy-mf2bp                                     1/1       Running            0          7m
kube-system   kube-scheduler-faas-a01.sl.cloud9.ibm.com            1/1       Running            0          29m
[root@faas-A01 ~]# kubectl logs --namespace=all coredns-78fcdf6894-p4wbl
Error from server (NotFound): namespaces "all" not found
[root@faas-A01 ~]# kubectl logs --namespace=kube-system coredns-78fcdf6894-p4wbl
standard_init_linux.go:178: exec user process caused "operation not permitted"

@sectorsize512
Copy link

just in case, selinux is in permissive mode on all nodes.

@sectorsize512
Copy link

An I'm using Calico (not weave as @carlosmkb).

@chrisohaver
Copy link

[root@faas-A01 ~]# kubectl logs --namespace=kube-system coredns-78fcdf6894-p4wbl
standard_init_linux.go:178: exec user process caused "operation not permitted"

Ah - This is an error from kubectl when trying to get the logs, not the contents of the logs...

@carlosrmendes
Copy link
Author

@chrisohaver the kubectl logs works with another kube-system pods

@chrisohaver
Copy link

OK - have you tried removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps?

@chrisohaver
Copy link

... does a kubectl describe of the coredns pod show anything interesting?

@Leozki
Copy link

Leozki commented Jul 26, 2018

Same issue for me.
CentOS Linux release 7.5.1804 (Core)
Docker version 1.13.1, build dded712/1.13.1
flannel as cni add-on

[root@k8s ~]# kubectl get pods --all-namespaces
NAMESPACE     NAME                                 READY     STATUS             RESTARTS   AGE
kube-system   coredns-78fcdf6894-cfmm7             0/1       CrashLoopBackOff   12         15m
kube-system   coredns-78fcdf6894-k65js             0/1       CrashLoopBackOff   11         15m
kube-system   etcd-k8s.master                      1/1       Running            0          14m
kube-system   kube-apiserver-k8s.master            1/1       Running            0          13m
kube-system   kube-controller-manager-k8s.master   1/1       Running            0          14m
kube-system   kube-flannel-ds-fts6v                1/1       Running            0          14m
kube-system   kube-proxy-4tdb5                     1/1       Running            0          15m
kube-system   kube-scheduler-k8s.master            1/1       Running            0          14m
[root@k8s ~]# kubectl logs coredns-78fcdf6894-cfmm7 -n kube-system
standard_init_linux.go:178: exec user process caused "operation not permitted"
[root@k8s ~]# kubectl describe pods coredns-78fcdf6894-cfmm7 -n kube-system
Name:           coredns-78fcdf6894-cfmm7
Namespace:      kube-system
Node:           k8s.master/192.168.150.40
Start Time:     Fri, 27 Jul 2018 00:32:09 +0800
Labels:         k8s-app=kube-dns
                pod-template-hash=3497892450
Annotations:    <none>
Status:         Running
IP:             10.244.0.12
Controlled By:  ReplicaSet/coredns-78fcdf6894
Containers:
  coredns:
    Container ID:  docker://3b7670fbc07084410984d7e3f8c0fa1b6d493a41d2a4e32f5885b7db9d602417
    Image:         k8s.gcr.io/coredns:1.1.3
    Image ID:      docker-pullable://k8s.gcr.io/coredns@sha256:db2bf53126ed1c761d5a41f24a1b82a461c85f736ff6e90542e9522be4757848
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 27 Jul 2018 00:46:30 +0800
      Finished:     Fri, 27 Jul 2018 00:46:30 +0800
    Ready:          False
    Restart Count:  12
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-vqslm (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-vqslm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-vqslm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                 From                 Message
  ----     ------            ----                ----                 -------
  Warning  FailedScheduling  16m (x6 over 16m)   default-scheduler    0/1 nodes are available: 1 node(s) were not ready.
  Normal   Scheduled         16m                 default-scheduler    Successfully assigned kube-system/coredns-78fcdf6894-cfmm7 to k8s.master
  Warning  BackOff           14m (x10 over 16m)  kubelet, k8s.master  Back-off restarting failed container
  Normal   Pulled            14m (x5 over 16m)   kubelet, k8s.master  Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
  Normal   Created           14m (x5 over 16m)   kubelet, k8s.master  Created container
  Normal   Started           14m (x5 over 16m)   kubelet, k8s.master  Started container
  Normal   Pulled            11m (x4 over 12m)   kubelet, k8s.master  Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
  Normal   Created           11m (x4 over 12m)   kubelet, k8s.master  Created container
  Normal   Started           11m (x4 over 12m)   kubelet, k8s.master  Started container
  Warning  BackOff           2m (x56 over 12m)   kubelet, k8s.master  Back-off restarting failed container
[root@k8s ~]# uname
Linux
[root@k8s ~]# uname -a
Linux k8s.master 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@k8s ~]# cat /etc/redhat-release 
CentOS Linux release 7.5.1804 (Core) 
[root@k8s ~]# docker --version
Docker version 1.13.1, build dded712/1.13.1

@zimnyjakub
Copy link

zimnyjakub commented Jul 27, 2018

I have the same issue when selinux is in persmissive mode. When I disable it in /etc/selinux/conf SELINUX=disabled and reboot machine the pod starts up.

Redhat 7.4, kernel 3.10.0-693.11.6.el7.x86_64
docker-1.13.1-68.gitdded712.el7.x86_64

@chrisohaver
Copy link

FYI, Also works for me with SELinux disabled (not permissive, but disabled).
Docker version 1.13.1, build dded712/1.13.1
CentOS 7

[root@centosk8s ~]# kubectl logs coredns-78fcdf6894-rhx9p -n kube-system
.:53
CoreDNS-1.1.3
linux/amd64, go1.10.1, b0fd575c
2018/07/27 16:37:31 [INFO] CoreDNS-1.1.3
2018/07/27 16:37:31 [INFO] linux/amd64, go1.10.1, b0fd575c
2018/07/27 16:37:31 [INFO] plugin/reload: Running configuration MD5 = 2a066f12ec80aeb2b92740dd74c17138

@lareeth
Copy link

lareeth commented Jul 30, 2018

We are also experiencing this issue, we provision infrastructure through automation, so requiring a restart to completely disable selinux is not acceptable. Are there any other workarounds why we wait for this to be fixed?

@chrisohaver
Copy link

chrisohaver commented Jul 30, 2018

Try removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps.
Updating to a more recent version of docker (than 1.13) may also help.

@thdrnsdk
Copy link

thdrnsdk commented Aug 1, 2018

Same issue here
Docker version 1.2.6
CentOS 7
like @lareeth we also provision kubernetes automation on use kubeadm, and also requiring a restart to completely disable selinux is not acceptable.
@chrisohaver try requiring a restart to completely disable selinux is not acceptable. it use helpful. thank you !
But as i know coredns options is not setting in kubeadm configuration
Is there no other way?

@chrisohaver
Copy link

chrisohaver commented Aug 1, 2018

Try removing "allowPrivilegeEscalation: false" from the CoreDNS deployment to see if that helps.
Updating to a more recent version of docker (e.g. to the version recommended by k8s) may also help.

@chrisohaver
Copy link

chrisohaver commented Aug 1, 2018

I verified that removing "allowPrivilegeEscalation: false" from the coredns deployment resolves the issue (with SE linux enabled in permissive mode).

@chrisohaver
Copy link

chrisohaver commented Aug 1, 2018

I also verified that upgrading to a version of docker recommended by Kubernetes (docker 17.03) resolves the issue, with "allowPrivilegeEscalation: false" left in place in the coredns deployment, and SELinux enabled in permissive mode.

@chrisohaver
Copy link

chrisohaver commented Aug 1, 2018

So, it appears as if there is a incompatibility between old versions of docker and SELinux with the allowPrivilegeEscalation directive which has apparently been resolved in later versions of docker.

There appear to be 3 different work-arounds:

  • Upgrade to newer version of docker, e.g. 17.03, the version currently recommended by k8s
  • Or remove allowPrivilegeEscalation=false from the deployment's pod spec
  • Or disable SELinux

@Leozki
Copy link

Leozki commented Aug 1, 2018

@chrisohaver I have resolved the issue by upgrading to newer version of docker 17.03. thx

@neolit123
Copy link
Member

thanks for the investigation @chrisohaver 💯

@kuznero
Copy link

kuznero commented Aug 10, 2018

Thanks, @chrisohaver !

This worked:

kubectl -n kube-system get deployment coredns -o yaml | \
  sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
  kubectl apply -f -

@neolit123
Copy link
Member

neolit123 commented Aug 10, 2018

@chrisohaver
do you think we should document this step in the kubeadm troubleshooting guide for SELinux nodes in the lines of:


coredns pods have CrashLoopBackOff or Error state

If you have nodes that are running SELinux with an older version of Docker you might experience a scenario where the coredns pods are not starting. To solve that you can try one of the following options:

  • Upgrade to a newer version of Docker - 17.03 is confirmed to work.
  • Disable SELinux.
  • Modify the coredns deployment to set allowPrivilegeEscalation to true:
kubectl -n kube-system get deployment coredns -o yaml | \
  sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' | \
  kubectl apply -f -

WDYT? please suggest amends to the text if you think something can be improved.

@chrisohaver
Copy link

Thats fine. We should perhaps mention that there are negative security implications when disabling SELinux, or changing the allowPrivilegeEscalation setting.

The most secure solution is to upgrade Docker to the version that Kubernetes recommends (17.03)

@neolit123
Copy link
Member

@chrisohaver
understood, will amend the copy and submit a PR for this.

@neolit123 neolit123 self-assigned this Aug 10, 2018
@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. kind/documentation Categorizes issue or PR as related to documentation. and removed priority/needs-more-evidence labels Aug 10, 2018
@neolit123 neolit123 added this to the v1.12 milestone Aug 10, 2018
@chuckha chuckha added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Aug 15, 2018
@mydockergit
Copy link

mydockergit commented Jan 23, 2019

There is also answer for that in stackoverflow:
https://stackoverflow.com/questions/53075796/coredns-pods-have-crashloopbackoff-or-error-state

This error

[FATAL] plugin/loop: Seen "HINFO IN 6900627972087569316.7905576541070882081." more than twice, loop detected

is caused when CoreDNS detects a loop in the resolve configuration, and it is the intended behavior. You are hitting this issue:

#1162

coredns/coredns#2087

Hacky solution: Disable the CoreDNS loop detection

Edit the CoreDNS configmap:

kubectl -n kube-system edit configmap coredns

Remove or comment out the line with loop, save and exit.

Then remove the CoreDNS pods, so new ones can be created with new config:

kubectl -n kube-system delete pod -l k8s-app=kube-dns

All should be fine after that.

Preferred Solution: Remove the loop in the DNS configuration

First, check if you are using systemd-resolved. If you are running Ubuntu 18.04, it is probably the case.

systemctl list-unit-files | grep enabled | grep systemd-resolved

If it is, check which resolv.conf file your cluster is using as reference:

ps auxww | grep kubelet

You might see a line like:

/usr/bin/kubelet ... --resolv-conf=/run/systemd/resolve/resolv.conf

The important part is --resolv-conf - we figure out if systemd resolv.conf is used, or not.

If it is the resolv.conf of systemd, do the following:

Check the content of /run/systemd/resolve/resolv.conf to see if there is a record like:

nameserver 127.0.0.1

If there is 127.0.0.1, it is the one causing the loop.

To get rid of it, you should not edit that file, but check other places to make it properly generated.

Check all files under /etc/systemd/network and if you find a record like

DNS=127.0.0.1

delete that record. Also check /etc/systemd/resolved.conf and do the same if needed. Make sure you have at least one or two DNS servers configured, such as

DNS=1.1.1.1 1.0.0.1

After doing all that, restart the systemd services to put your changes into effect:
systemctl restart systemd-networkd systemd-resolved

After that, verify that DNS=127.0.0.1 is no more in the resolv.conf file:

cat /run/systemd/resolve/resolv.conf

Finally, trigger re-creation of the DNS pods

kubectl -n kube-system delete pod -l k8s-app=kube-dns

Summary: The solution involves getting rid of what looks like a DNS lookup loop from the host DNS configuration. Steps vary between different resolv.conf managers/implementations.

@chrisohaver
Copy link

Thanks. It's also covered in the CoreDNS loop plugin readme ...

@mengxifl
Copy link

mengxifl commented Mar 20, 2019

I have same problem , and another problem
1、mean i can not found dns . the error is
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:57088->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:38819->172.16.254.1:53: i/o timeout
........

my /etc/resolv.com
noly have
nameserver 172.16.254.1 #this is my dns
nameserver 8.8.8.8 #another dns in net
i run

kubectl -n kube-system get deployment coredns -o yaml |
sed 's/allowPrivilegeEscalation: false/allowPrivilegeEscalation: true/g' |
kubectl apply -f -

then pod rebuild only have one error

[ERROR] plugin/errors: 2 10594135170717325.8545646296733374240. HINFO: unreachable backend: no upstream host

I don't know if that's normal . maybe

2、the coredns cannot found my api service . error is

kube-dns Failed to list *v1.Endpoints getsockopt: 10.96.0.1:6443 api connection refused

coredns restart again and again ,at last will CrashLoopBackOff

so i have to run coredns on master node i do that

kubectl edit deployment/coredns --namespace=kube-system
spec.template.spec
nodeSelector:
node-role.kubernetes.io/master: ""

I don't know if that's normal

at last give my env

Linux 4.20.10-1.el7.elrepo.x86_64 /// centos 7

docker Version: 18.09.3

[root@k8smaster00 ~]# docker image ls -a
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s.gcr.io/kube-controller-manager v1.13.3 0482f6400933 6 weeks ago 146MB
k8s.gcr.io/kube-proxy v1.13.3 98db19758ad4 6 weeks ago 80.3MB
k8s.gcr.io/kube-apiserver v1.13.3 fe242e556a99 6 weeks ago 181MB
k8s.gcr.io/kube-scheduler v1.13.3 3a6f709e97a0 6 weeks ago 79.6MB
quay.io/coreos/flannel v0.11.0-amd64 ff281650a721 7 weeks ago 52.6MB
k8s.gcr.io/coredns 1.2.6 f59dcacceff4 4 months ago 40MB
k8s.gcr.io/etcd 3.2.24 3cab8e1b9802 6 months ago 220MB
k8s.gcr.io/pause 3.1 da86e6ba6ca1 15 months ago 742kB

kubenets is 1.13.3

I think this is a bug Expect an official update or a solution

@chrisohaver
Copy link

I have same problem ...

@mengxifl, Those errors are significantly different than the ones reported and discussed in this issue.

[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:57088->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:38819->172.16.254.1:53: i/o timeout

Those errors mean that the CoreDNS pod (and probably all other pods) cannot reach your nameservers. This suggests a networking problem in your cluster to the outside world. Possibly flannel misconfiguration or firewalls.

the coredns cannot found my api service ...
so i have to run coredns on master node

This is also not normal. If I understand you correctly, you are saying that CoreDNS can contact the API from the master node but not other nodes. This would suggest pod to service networking problems between nodes within your cluster - perhaps an issue with flannel configuration or firewalls.

@mengxifl
Copy link

mengxifl commented Mar 21, 2019

I have same problem ...

@mengxifl, Those errors are significantly different than the ones reported and discussed in this issue.

[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:57088->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 2115717704248378980.1120568170924441806. HINFO: unreachable backend: read udp 10.224.0.3:38819->172.16.254.1:53: i/o timeout

Those errors mean that the CoreDNS pod (and probably all other pods) cannot reach your nameservers. This suggests a networking problem in your cluster to the outside world. Possibly flannel misconfiguration or firewalls.

the coredns cannot found my api service ...
so i have to run coredns on master node

This is also not normal. If I understand you correctly, you are saying that CoreDNS can contact the API from the master node but not other nodes. This would suggest pod to service networking problems between nodes within your cluster - perhaps an issue with flannel configuration or firewalls.

Thank you for your reply

maybe i should put up my yaml file

I use
kubeadm init --config=config.yaml

my config.yaml content is

apiVersion: kubeadm.k8s.io/v1alpha3
kind: InitConfiguration
apiEndpoint:
  advertiseAddress: "172.16.254.74"
  bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1alpha3
kind: ClusterConfiguration
kubernetesVersion: "v1.13.3"
etcd:
  external:
    endpoints:
    - "https://172.16.254.86:2379" 
    - "https://172.16.254.87:2379"
    - "https://172.16.254.88:2379"
    caFile: /etc/kubernetes/pki/etcd/ca.pem
    certFile: /etc/kubernetes/pki/etcd/client.pem
    keyFile: /etc/kubernetes/pki/etcd/client-key.pem
networking:
  podSubnet: "10.224.0.0/16"
  serviceSubnet: "10.96.0.0/12"
apiServerCertSANs:
- k8smaster00
- k8smaster01
- k8snode00
- k8snode01
- 172.16.254.74
- 172.16.254.79
- 172.16.254.80
- 172.16.254.81
- 172.16.254.85 #Vip
- 127.0.0.1
clusterName: "cluster"
controlPlaneEndpoint: "172.16.254.85:6443"

apiServerExtraArgs:
  service-node-port-range: 20-65535

my fannel yaml is default

https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

systemctl status firewalld
ALL node say
Unit firewalld.service could not be found.

cat /etc/sysconfig/iptables
ALL node say
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -p tcp -m tcp --dport 1:65535 -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT
-A OUTPUT -p tcp -m tcp --sport 1:65535 -j ACCEPT
-A FORWARD -p tcp -m tcp --dport 1:65535 -j ACCEPT
-A FORWARD -p tcp -m tcp --sport 1:65535 -j ACCEPT
COMMI

cat /etc/resolv.conf & ping bing.com
ALL node say
[1] 6330
nameserver 172.16.254.1
nameserver 8.8.8.8
PING bing.com (13.107.21.200) 56(84) bytes of data.
64 bytes from 13.107.21.200 (13.107.21.200): icmp_seq=2 ttl=111 time=149 ms

uname -rs
master node say
Linux 4.20.10-1.el7.elrepo.x86_64

uname -rs
slave node say
Linux 4.4.176-1.el7.elrepo.x86_64

so i don't think firewall have issue mybe fannel ? but i use default config . And maybe linux version . i don't know .

OK I run
/sbin/iptables -t nat -I POSTROUTING -s 10.224.0.0/16 -j MASQUERADE

on all my node that work for me . thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.