Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random task failures with "non-zero exit (137)" since 18.09.1 #38768

Closed
lwimmer opened this issue Feb 21, 2019 · 28 comments
Closed

Random task failures with "non-zero exit (137)" since 18.09.1 #38768

lwimmer opened this issue Feb 21, 2019 · 28 comments

Comments

@lwimmer
Copy link

lwimmer commented Feb 21, 2019

Description

Since the upgrade from 18.09.0 to 18.09.1 (same with 18.09.2) we experience random task failures.

The tasks die with non-zero exit (137) error messages, which indicate a SIGKILL being received.
A common reason is a OOM kill, but this is not the case for our containers.
We monitor the memory usage, and the affected containers are well below all limits (per container and also the host has enough memory free).
Also there is not the usual stack trace from the kernel in the logs and also a docker inspect on the dead containers show "OOMKilled": false.
We tried forcefully provoking a OOM kill and it shows the expected stack trace and in this case also the OOMKilled flag set to true.

Also the containers are not supposed to shut down and also the health checks are not the culprit.
We experience this with practically all our containers, which are very different.
For example also with the official nginx image only serving static files, so we don't expect our containers to be the culprit.
Also because with the very same images we don't experience this issue with 18.09.0.

So the question is who is killing our containers?

We managed to capture the culprit sending the SIGKILL with auditd.
Here is the relevant ausearch output:

----
type=PROCTITLE msg=audit(02/21/19 00:22:56.549:883) : proctitle=runc --root /var/run/docker/runtime-runc/moby --log-format json delete --force ad6e6f45bec2a3201313b02af4f038830aaad7ab052548608
type=OBJ_PID msg=audit(02/21/19 00:22:56.549:883) : opid=2488 oauid=unset ouid=root oses=-1 ocomm=wrapper-linux-x
type=SYSCALL msg=audit(02/21/19 00:22:56.549:883) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x9b8 a1=SIGKILL a2=0x0 a3=0x0 items=0 ppid=795 pid=27127 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=runc exe=/usr/sbin/runc key=kill_signals

So it seems runc is for some reason killing the container. But why?

Steps to reproduce the issue:

  1. Setup a docker swarm with 18.09.2 with a lot of running container and wait a few days.
  2. Experience random task failures.

Describe the results you received:
Tasks randomly failing:

dockerd[794]: time="2019-02-21T00:22:56.945715049Z" level=error msg="fatal task error" error="task: non-zero exit (137)" module=node/agent/taskmanager node.id=3whne2nbw3atm30sxp2w9elhx service.id=jjt0smq927b91b4s68lp9lquj task.id=p
0kmz6uqsp7izlfkrr25534sc

Describe the results you expected:
Tasks NOT randomly failing.

We have downgraded our production environment to 18.09.0 again and have not experienced the failure in the last weeks, everything else stayed the same (images, configuration, kernel, etc.).
So it is definitely a problem introduced in 18.09.1.

Additional information you deem important (e.g. issue happens only occasionally):
We have a Docker swarm running with around 25 nodes, 35 services and 100 containers.

The issue happens around 2-5 times a day with completely different containers on all of the nodes.
We have not had a single day with not at least 2 kills since the upgrade.
We had it happen up to 6 times on a single day.

Every container seems to be evenly likely to be affected.

Output of docker version:

Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        6247962
 Built:             Sun Feb 10 04:13:47 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:42:13 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 30
 Running: 17
 Paused: 0
 Stopped: 13
Images: 29
Server Version: 18.09.2
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: efs local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 3whne2nbw3atm30sxp2w9elhx
 Is Manager: false
 Node Address: 10.1.148.157
 Manager Addresses:
  10.1.157.55:2377
  10.1.17.3:2377
  10.1.90.128:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1032-aws
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.691GiB
Name: frontend0.dev.ecosio.internal
ID: LOBR:KA6Q:CHBM:OB3V:Q4PS:MFI7:ARBE:XQRO:35PO:UFED:SLHZ:UTTE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
 10.0.0.0/10
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
On AWS/EC2.

@bill-warshaw
Copy link

We're seeing an identical issue with 18.09.2 (6247962).

There's also a ticket on the Docker forums which seems to reference the same bug: https://forums.docker.com/t/container-fails-with-error-137-but-no-oom-flag-set-and-theres-plenty-of-ram/69336/5

@thaJeztah thaJeztah added this to backlog in maintainers-session Mar 6, 2019
@silpamittapalli
Copy link

We are experiencing the same issue on v18.09.2 and it's evident across multiple sites daily.

Let us know if there is any additional information we can provide to track the issue down.

@thaJeztah
Copy link
Member

Is this only for 18.09.2 or also with 18.09.1 (more accurately; the version of containerd and runc shipping with 18.09.2)? There was a fix in runc to fix a CVE that's causing more memory to be used when starting a container.

@cpuguy83
Copy link
Member

cpuguy83 commented Mar 7, 2019

Do you have memory constraints set on these containers, by chance?

@kolyshkin
Copy link
Contributor

runc kills container as it's run with delete --force ID arguments. The question is who called runc.

@lwimmer is there anything unusual in dockerd logs about this container?

@kolyshkin
Copy link
Contributor

This is definitely not an OOM kill, but force removal of a container (an equivalent of docker rm -f ID (or runc delete --force ID). The question is why it happens.

@lwimmer
Copy link
Author

lwimmer commented Mar 8, 2019

It definitely happens with v18.09.1 and v18.09.2.
v18.09.0 is definitely not affected.
We've updated a testing environment to v18.09.3 yesterday and so far it did not happen, but it's a bit early to celebrate, maybe it's just less likely.

Memory constraints are set on the containers, but the memory usage is far below the limit.

There is nothing else in the logs. It basically starts with the log line I've posted above. If it happens again with v18.09.3, I will add a more thorough log.

@lwimmer
Copy link
Author

lwimmer commented Mar 8, 2019

The not affected runc and containerd versions (18.09.0):

containerd version: c4446665cb9c30056f4998ed953e6d4ff22c7c39
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871

Affected are definitely (18.09.1):

containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 96ec2177ae841256168fcf76954f7177af9446eb

and (18.09.2):

containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec

Probably ok (18.09.3):

containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d

@lwimmer
Copy link
Author

lwimmer commented Mar 14, 2019

Good news, I think: It seems that 18.09.3 solves that issue.
We have 18.09.3 running for a week now on our test environment and we didn't experience a single task failure like described above.

@thaJeztah
Copy link
Member

@lwimmer thanks! That's good to hear.

For reference, posting the diffs for containerd and runc between 18.09.2 and 18.09.3;

@DuncanMcArdle
Copy link

DuncanMcArdle commented Mar 18, 2019

Poster of the Docker forums ticket here. After experiencing the issues almost daily around 2 weeks ago, I've now yet to experience a supposed OOM kill in over a week, despite making no changes whatsoever. @lwimmer I'd be mindful of this when considering it solved in 18.09.03, just in case it's just in a lull...

@sph3rex
Copy link

sph3rex commented Mar 21, 2019

Problem still present in 18.09.3 but only with containers created with "service create". I can confirm that the issue does not exist in 18.09.0

@nirmaldavis
Copy link

nirmaldavis commented May 6, 2019

What was the containerd version when this issue happens?
Does it happens even after containerd upgrade to above 1.2.3?

Looks like containerd 1.2.2 has some related issue. containerd/containerd#2969
Does this issue happens after upgrading containerd along with docker?
Looks like containerd 1.2.3 above has some race condition fixes related to this. containerd/containerd#2970

Also I see a request in docker related to it. docker/for-linux#586

@Ghostbaby
Copy link

@lwimmer thanks!
We also encountered the same problem,
Is it possible to provide the audit.rules configuration file?

@lwimmer
Copy link
Author

lwimmer commented May 24, 2019

I believe we used the following audit.rules:

-a exit,always -F arch=b64 -S kill -F a1=9 -k kill_signals

We have not experienced the issue anymore with 18.09.3 - 18.09.6. We only had it with 18.09.1 and 18.09.2.

@Ghostbaby
Copy link

@lwimmer Thank you very much for your configuration file.
update my containerd version to 1.2.5, this issue looks fixed.

@thaJeztah
Copy link
Member

I'll go ahead and close this issue as it looks to be resolved

@guunergooner
Copy link

@lwimmer @Ghostbaby @thaJeztah
Our production environment also encountered the issue of container 137 exiting unexpectedly on version 18.09.2.And captured the log of container abnormal exit.

The essence of the problem is that the containerd exec code accessed the null pointer, which caused the containerd-shim process to exit abnormally and trigger a clean up.

Similar issue on containerd #2969.

This mr #2970 has been fixed.

So containerd 1.2.3 and later versions will not have 137 container exit issues.

  • docker version
root@gpu-ser370:~$ docker version
Client:
Version:           18.09.2
API version:       1.39
Go version:        go1.10.6
Git commit:        6247962
Built:             Sun Feb 10 04:13:27 2019
OS/Arch:           linux/amd64
Experimental:      false

Server: Docker Engine - Community
Engine:
Version:          18.09.2
API version:      1.39 (minimum version 1.12)
Go version:       go1.10.6
Git commit:       6247962
Built:            Sun Feb 10 03:47:25 2019
OS/Arch:          linux/amd64
Experimental:     false
  • containerd version
root@gpu-ser370:~$ ctr --address /var/run/docker/containerd/containerd.sock version
Client:
Version:  1.2.2
Revision: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
Server:
Version:  1.2.2
Revision: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
root@gpu-ser370:~$ containerd -v
containerd github.com/containerd/containerd 1.2.2 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
  • runc version
root@localhost:~$ runc -v
runc version 1.0.0-rc6+dev
commit: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
spec: 1.0.1-dev
root@gpu-ser370:~$ nvidia-container-runtime -v
runc version 1.0.0-rc6+dev
commit: 6635b4f0c6af3810594d2770f662f34ddc15b40d-dirty
spec: 1.0.1-dev
  • panic log
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.374422177+08:00" level=debug msg="Calling POST /v1.39/containers/de42a6372c1d996efc475d294afc42f430beaad3ff443a86f124754a70a1b517/exec"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.374608247+08:00" level=debug msg="form data: {\"AttachStderr\":true,\"AttachStdin\":true,\"AttachStdout\":true,\"Cmd\":[\"ip\",\"a\"],\"Detach\":false,\"DetachKeys\":\"\",\"Env\":null,\"Privileged\":false,\"Tty\":false,\"User\":\"\",\"WorkingDir\":\"\"}"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.376137747+08:00" level=debug msg="Calling POST /v1.39/exec/0f80962a9b908b89ea5af2f20f361e9dc10cfee6986cdc29d59e040e863427ef/start"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.376236749+08:00" level=debug msg="form data: {\"Detach\":false,\"Tty\":false}"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.376409211+08:00" level=debug msg="starting exec command 0f80962a9b908b89ea5af2f20f361e9dc10cfee6986cdc29d59e040e863427ef in container de42a6372c1d996efc475d294afc42f430beaad3ff443a86f124754a70a1b517"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.377794937+08:00" level=debug msg="attach: stdin: begin"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.377888031+08:00" level=debug msg="attach: stdout: begin"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.377945302+08:00" level=debug msg="attach: stderr: begin"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.378383214+08:00" level=debug msg="Closing buffered stdin pipe"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.378459556+08:00" level=debug msg="attach: stdin: end"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.521401882+08:00" level=debug msg="event published" ns=moby topic="/tasks/exec-started" type=containerd.events.TaskExecStarted
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.522615348+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exec-started
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.742957161+08:00" level=debug msg="event published" ns=moby topic="/tasks/exit" type=containerd.events.TaskExit
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.744535201+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.745009232+08:00" level=debug msg="attach: stderr: end"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.745069071+08:00" level=debug msg="attach: stdout: end"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.745086517+08:00" level=debug msg="attach done"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.749102092+08:00" level=debug msg="Calling GET /v1.39/exec/0f80962a9b908b89ea5af2f20f361e9dc10cfee6986cdc29d59e040e863427ef/json"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.793241263+08:00" level=debug msg="Calling GET /v1.38/containers/json?all=1&filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dpodsandbox%22%3Atrue%7D%7D&limit=0"
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.803366670+08:00" level=debug msg="Calling GET /v1.38/containers/json?all=1&filters=%7B%22label%22%3A%7B%22io.kubernetes.docker.type%3Dcontainer%22%3Atrue%7D%7D&limit=0"
Feb 25 22:07:13 gpu-ser370 systemd: Removed slice User Slice of root.
Feb 25 22:07:13 gpu-ser370 dockerd: panic: runtime error: invalid memory address or nil pointer dereference
Feb 25 22:07:13 gpu-ser370 dockerd: [signal SIGSEGV: segmentation violation code=0x1 addr=0x78 pc=0x5ff488]
Feb 25 22:07:13 gpu-ser370 dockerd: goroutine 5 [running]:
Feb 25 22:07:13 gpu-ser370 dockerd: github.com/containerd/containerd/runtime/v1/linux/proc.(*execProcess).pidv(...)
Feb 25 22:07:13 gpu-ser370 dockerd: /go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec.go:76
Feb 25 22:07:13 gpu-ser370 dockerd: github.com/containerd/containerd/runtime/v1/linux/proc.(*execStoppedState).Pid(0x8d8cf0, 0xffffffffffffffff)
Feb 25 22:07:13 gpu-ser370 dockerd: /go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec_state.go:175 +0x8
Feb 25 22:07:13 gpu-ser370 dockerd: github.com/containerd/containerd/runtime/v1/linux/proc.(*execProcess).Pid(0xc4201e0000, 0x55ec)
Feb 25 22:07:13 gpu-ser370 dockerd: /go/src/github.com/containerd/containerd/runtime/v1/linux/proc/exec.go:72 +0x34
Feb 25 22:07:13 gpu-ser370 dockerd: github.com/containerd/containerd/runtime/v1/shim.(*Service).checkProcesses(0xc42000e0c0, 0xbf8d68647291f929, 0xbec8633ef91e, 0x8bc720, 0x4308, 0x0)
Feb 25 22:07:13 gpu-ser370 dockerd: /go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:514 +0xde
Feb 25 22:07:13 gpu-ser370 dockerd: github.com/containerd/containerd/runtime/v1/shim.(*Service).processExits(0xc42000e0c0)
Feb 25 22:07:13 gpu-ser370 dockerd: /go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:492 +0xd0
Feb 25 22:07:13 gpu-ser370 dockerd: created by github.com/containerd/containerd/runtime/v1/shim.NewService
Feb 25 22:07:13 gpu-ser370 dockerd: /go/src/github.com/containerd/containerd/runtime/v1/shim/service.go:91 +0x3e9
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.874199295+08:00" level=info msg="shim reaped" id=de42a6372c1d996efc475d294afc42f430beaad3ff443a86f124754a70a1b517
Feb 25 22:07:13 gpu-ser370 dockerd: time="2020-02-25T22:07:13.874266470+08:00" level=warning msg="cleaning up after killed shim" id=de42a6372c1d996efc475d294afc42f430beaad3ff443a86f124754a70a1b517 namespace=moby

@straurob
Copy link

straurob commented Aug 4, 2020

We're experiencing the described issue on Docker 19.03.2 and containerd 1.2.6. From what we can observe, the containers are not killed due to OOM. We're running a Docker Swarm, the containers/tasks are created by docker service create.

$ docker version
Client: Docker Engine - Community
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        6a30dfca03
 Built:             Thu Aug 29 05:29:49 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfca03
  Built:            Thu Aug 29 05:28:23 2019
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
$ ctr --address /var/run/docker/containerd/containerd.sock version
Client:
  Version:  1.2.6
  Revision: 894b81a4b802e4eb2a91d1ce216b8817763c29fb

@thaJeztah
Copy link
Member

@straurob docker 19.03.2 and containerd 1.2.6 are quite some patch releases behind the latest; if you have a test setup where you can test/reproduce, are you still seeing this on the latest patch releases for both? (not sure if anything in this area changed, but would be useful to know if it's still an issue or if it has been fixed since)

@straurob
Copy link

straurob commented Aug 4, 2020

@thaJeztah Thanks for the information. We'll check out the latest patch releases. Unfortunately we have observed this behavior only in our production environment so far but our stages are running on different versions so that might be reasonable. I'll post more information here as soon as I have some.

@thaJeztah
Copy link
Member

Thanks! That's appreciated.

@straurob
Copy link

straurob commented Sep 7, 2020

@thaJeztah We upgraded all our swarm nodes to Docker 19.03.12. The nodes have been running for a week since then and as far as I can tell updating seems to have resolved the issue. At least I couldn't observe any of these random restarts anymore.

Server: Docker Engine - Community
 Engine:
  Version:          19.03.12
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       48a66213fe
  Built:            Mon Jun 22 15:44:23 2020
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

@thaJeztah
Copy link
Member

Perfect! Thanks for the update, @straurob 👍

@devopscloudsre
Copy link

We are in below version and we have this issue. Any help, may be we need to reopen this thread

Init Binary: docker-init
containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd

[root@ip-172-20-0-13 compose-stacks]# docker --version
Docker version 19.03.13, build 4484c46d9d
[root@ip-172-20-0-13 compose-stacks]#

[root@ip-172-20-0-13 compose-stacks]# docker info
Client:
Debug Mode: false

Server:
Containers: 5
Running: 5
Paused: 0
Stopped: 0
Images: 6
Server Version: 19.03.4

@thaJeztah
Copy link
Member

@mailrahulsre your daemon is running Server Version: 19.03.4; try if updating your engine and containerd.io to a current version helps; current version of the engine is 19.03.13, and containerd.io 1.3.7

@koushr
Copy link

koushr commented Oct 21, 2022

Description

Since the upgrade from 18.09.0 to 18.09.1 (same with 18.09.2) we experience random task failures.

The tasks die with non-zero exit (137) error messages, which indicate a SIGKILL being received. A common reason is a OOM kill, but this is not the case for our containers. We monitor the memory usage, and the affected containers are well below all limits (per container and also the host has enough memory free). Also there is not the usual stack trace from the kernel in the logs and also a docker inspect on the dead containers show "OOMKilled": false. We tried forcefully provoking a OOM kill and it shows the expected stack trace and in this case also the OOMKilled flag set to true.

Also the containers are not supposed to shut down and also the health checks are not the culprit. We experience this with practically all our containers, which are very different. For example also with the official nginx image only serving static files, so we don't expect our containers to be the culprit. Also because with the very same images we don't experience this issue with 18.09.0.

So the question is who is killing our containers?

We managed to capture the culprit sending the SIGKILL with auditd. Here is the relevant ausearch output:

----
type=PROCTITLE msg=audit(02/21/19 00:22:56.549:883) : proctitle=runc --root /var/run/docker/runtime-runc/moby --log-format json delete --force ad6e6f45bec2a3201313b02af4f038830aaad7ab052548608
type=OBJ_PID msg=audit(02/21/19 00:22:56.549:883) : opid=2488 oauid=unset ouid=root oses=-1 ocomm=wrapper-linux-x
type=SYSCALL msg=audit(02/21/19 00:22:56.549:883) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x9b8 a1=SIGKILL a2=0x0 a3=0x0 items=0 ppid=795 pid=27127 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=runc exe=/usr/sbin/runc key=kill_signals

So it seems runc is for some reason killing the container. But why?

Steps to reproduce the issue:

  1. Setup a docker swarm with 18.09.2 with a lot of running container and wait a few days.
  2. Experience random task failures.

Describe the results you received: Tasks randomly failing:

dockerd[794]: time="2019-02-21T00:22:56.945715049Z" level=error msg="fatal task error" error="task: non-zero exit (137)" module=node/agent/taskmanager node.id=3whne2nbw3atm30sxp2w9elhx service.id=jjt0smq927b91b4s68lp9lquj task.id=p
0kmz6uqsp7izlfkrr25534sc

Describe the results you expected: Tasks NOT randomly failing.

We have downgraded our production environment to 18.09.0 again and have not experienced the failure in the last weeks, everything else stayed the same (images, configuration, kernel, etc.). So it is definitely a problem introduced in 18.09.1.

Additional information you deem important (e.g. issue happens only occasionally): We have a Docker swarm running with around 25 nodes, 35 services and 100 containers.

The issue happens around 2-5 times a day with completely different containers on all of the nodes. We have not had a single day with not at least 2 kills since the upgrade. We had it happen up to 6 times on a single day.

Every container seems to be evenly likely to be affected.

Output of docker version:

Client:
 Version:           18.09.2
 API version:       1.39
 Go version:        go1.10.6
 Git commit:        6247962
 Built:             Sun Feb 10 04:13:47 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.2
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.6
  Git commit:       6247962
  Built:            Sun Feb 10 03:42:13 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 30
 Running: 17
 Paused: 0
 Stopped: 13
Images: 29
Server Version: 18.09.2
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: efs local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
 NodeID: 3whne2nbw3atm30sxp2w9elhx
 Is Manager: false
 Node Address: 10.1.148.157
 Manager Addresses:
  10.1.157.55:2377
  10.1.17.3:2377
  10.1.90.128:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.15.0-1032-aws
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.691GiB
Name: frontend0.dev.ecosio.internal
ID: LOBR:KA6Q:CHBM:OB3V:Q4PS:MFI7:ARBE:XQRO:35PO:UFED:SLHZ:UTTE
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
 10.0.0.0/10
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.): On AWS/EC2.

capture the culprit sending the SIGKILL with auditd.

how to capture the culprit sending the SIGKILL with auditd? which file or directory to auditctl?

@lwimmer
Copy link
Author

lwimmer commented Oct 23, 2022

how to capture the culprit sending the SIGKILL with auditd? which file or directory to auditctl?

I've posted the audit rules in this comment: #38768 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests