New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random task failures with "non-zero exit (137)" since 18.09.1 #38768
Comments
We're seeing an identical issue with There's also a ticket on the Docker forums which seems to reference the same bug: https://forums.docker.com/t/container-fails-with-error-137-but-no-oom-flag-set-and-theres-plenty-of-ram/69336/5 |
We are experiencing the same issue on v18.09.2 and it's evident across multiple sites daily. Let us know if there is any additional information we can provide to track the issue down. |
Is this only for 18.09.2 or also with 18.09.1 (more accurately; the version of containerd and runc shipping with 18.09.2)? There was a fix in runc to fix a CVE that's causing more memory to be used when starting a container. |
Do you have memory constraints set on these containers, by chance? |
runc kills container as it's run with @lwimmer is there anything unusual in dockerd logs about this container? |
This is definitely not an OOM kill, but force removal of a container (an equivalent of |
It definitely happens with v18.09.1 and v18.09.2. Memory constraints are set on the containers, but the memory usage is far below the limit. There is nothing else in the logs. It basically starts with the log line I've posted above. If it happens again with v18.09.3, I will add a more thorough log. |
The not affected runc and containerd versions (18.09.0):
Affected are definitely (18.09.1):
and (18.09.2):
Probably ok (18.09.3):
|
Good news, I think: It seems that 18.09.3 solves that issue. |
@lwimmer thanks! That's good to hear. For reference, posting the diffs for containerd and runc between 18.09.2 and 18.09.3; |
Poster of the Docker forums ticket here. After experiencing the issues almost daily around 2 weeks ago, I've now yet to experience a supposed OOM kill in over a week, despite making no changes whatsoever. @lwimmer I'd be mindful of this when considering it solved in 18.09.03, just in case it's just in a lull... |
Problem still present in 18.09.3 but only with containers created with "service create". I can confirm that the issue does not exist in 18.09.0 |
What was the containerd version when this issue happens? Looks like containerd 1.2.2 has some related issue. containerd/containerd#2969 Also I see a request in docker related to it. docker/for-linux#586 |
@lwimmer thanks! |
I believe we used the following
We have not experienced the issue anymore with 18.09.3 - 18.09.6. We only had it with 18.09.1 and 18.09.2. |
@lwimmer Thank you very much for your configuration file. |
I'll go ahead and close this issue as it looks to be resolved |
@lwimmer @Ghostbaby @thaJeztah The essence of the problem is that the containerd exec code accessed the null pointer, which caused the containerd-shim process to exit abnormally and trigger a clean up. Similar issue on containerd #2969. This mr #2970 has been fixed. So containerd 1.2.3 and later versions will not have 137 container exit issues.
|
We're experiencing the described issue on Docker 19.03.2 and containerd 1.2.6. From what we can observe, the containers are not killed due to OOM. We're running a Docker Swarm, the containers/tasks are created by
|
@straurob docker 19.03.2 and containerd 1.2.6 are quite some patch releases behind the latest; if you have a test setup where you can test/reproduce, are you still seeing this on the latest patch releases for both? (not sure if anything in this area changed, but would be useful to know if it's still an issue or if it has been fixed since) |
@thaJeztah Thanks for the information. We'll check out the latest patch releases. Unfortunately we have observed this behavior only in our production environment so far but our stages are running on different versions so that might be reasonable. I'll post more information here as soon as I have some. |
Thanks! That's appreciated. |
@thaJeztah We upgraded all our swarm nodes to Docker 19.03.12. The nodes have been running for a week since then and as far as I can tell updating seems to have resolved the issue. At least I couldn't observe any of these random restarts anymore.
|
Perfect! Thanks for the update, @straurob 👍 |
We are in below version and we have this issue. Any help, may be we need to reopen this thread Init Binary: docker-init [root@ip-172-20-0-13 compose-stacks]# docker --version [root@ip-172-20-0-13 compose-stacks]# docker info Server: |
@mailrahulsre your daemon is running |
how to capture the culprit sending the SIGKILL with |
I've posted the audit rules in this comment: #38768 (comment) |
Description
Since the upgrade from 18.09.0 to 18.09.1 (same with 18.09.2) we experience random task failures.
The tasks die with
non-zero exit (137)
error messages, which indicate a SIGKILL being received.A common reason is a OOM kill, but this is not the case for our containers.
We monitor the memory usage, and the affected containers are well below all limits (per container and also the host has enough memory free).
Also there is not the usual stack trace from the kernel in the logs and also a
docker inspect
on the dead containers show"OOMKilled": false
.We tried forcefully provoking a OOM kill and it shows the expected stack trace and in this case also the
OOMKilled
flag set to true.Also the containers are not supposed to shut down and also the health checks are not the culprit.
We experience this with practically all our containers, which are very different.
For example also with the official
nginx
image only serving static files, so we don't expect our containers to be the culprit.Also because with the very same images we don't experience this issue with 18.09.0.
So the question is who is killing our containers?
We managed to capture the culprit sending the SIGKILL with
auditd
.Here is the relevant
ausearch
output:So it seems
runc
is for some reason killing the container. But why?Steps to reproduce the issue:
Describe the results you received:
Tasks randomly failing:
Describe the results you expected:
Tasks NOT randomly failing.
We have downgraded our production environment to 18.09.0 again and have not experienced the failure in the last weeks, everything else stayed the same (images, configuration, kernel, etc.).
So it is definitely a problem introduced in 18.09.1.
Additional information you deem important (e.g. issue happens only occasionally):
We have a Docker swarm running with around 25 nodes, 35 services and 100 containers.
The issue happens around 2-5 times a day with completely different containers on all of the nodes.
We have not had a single day with not at least 2 kills since the upgrade.
We had it happen up to 6 times on a single day.
Every container seems to be evenly likely to be affected.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
On AWS/EC2.
The text was updated successfully, but these errors were encountered: