-
Notifications
You must be signed in to change notification settings - Fork 40.6k
Container with multiple processes not terminated when OOM #50632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kellycampbell
Note: Method 1 will trigger an email to the group. You can find the group list here and label list here. |
@kellycampbell - Did you try the Thanks, |
Yes, we already use die-on-term. This works when the main process (uwsgi) receives sigterm, e.g. during rolling update. This problem isn't a sigterm, and it is going to the child process (uwsgi-python worker). Here's our uwsgi config if it helps:
|
Only thing on the sig list which looks like applicable is resource-management. /sig wg-resource-management |
@kellycampbell i don't see any support in uwsgi for the parent to get notified when the child gets killed. i can see folks struggling with this with just docker - google for "oom-killer docker kill all processes" Is |
uwsgi reaps the child process just fine, and restarts another worker in its place. The problem from my point of view is that this doesn't match the k8s documentation about memory limits as quoted in my first post, and doesn't surface the fact that the infrastructure is killing a process, e.g. in the events list for the pod or somewhere easy to notice. Ideally, there would also be a way to gracefully handle the interruption of the process being killed as feature request #40157 requests. |
IIUC, the process which consumes the most memory will be oom-killed in this case. The container won't terminate unless the killed perocess is the main process within the container. /sig node |
@xiangpengzhao yes, that's why i was looking at options for @kellycampbell where the main process is the only one in the container. Guess we need a big disclaimer about child process(es) being oom-killed |
I think maybe it's not clear how the resource limits are enforced. After troubleshooting this issue, I discovered my own misunderstanding of responsibilities for k8s vs the container runtime and linux cgroups. I found this documentation helpful in understanding what is happening: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html This other page in the k8s docs could have better info under the "How Pods with resource limits are run" section: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ Outside of documentation changes, the two other things that I think should be considered long-term for k8s are: a) how to surface the event when a particular container breached its memory limit and had processes killed (if the pod doesn't terminate itself from the oom-killed process) so admins know why things aren't working. b) a way to more gracefully handle containers reaching their memory limits, e.g. with a signal to the pod (#40157) |
/cc @kubernetes/sig-node-bugs |
Containers are marked as OOM killed only when the init pid gets killed by the kernel OOM killer. There are apps that can tolerate OOM kills of non init processes and so we chose to not track non-init process OOM kills. |
I'd say this is Working as Intended. |
I agree with @kellycampbell this behavior is not very well documented... |
I just ran into this issue too and I agree that this isn't well documented. I can see how one would assume that k8s enforces the memory limit and communicates this via the api/events/metrics. The real problem IMO though is a lack of visibility when this happens. You can get this from the kernel log and more recent kernels expose that in vmstat (which is surfaced by the node-exporter as node_vmstat_oom_kill) but can't be correlated to a pod. |
Hello, This can lead to misbehaving or non-optimal Pods which still pass the healthchecks but should be destroyed anyway. I actually had a case where the same process was being killed over and over (~2000 times over 1 hour) but was kept being re-spawned by its init process. Then the init process got OOMKilled and the container restarted. I suppose this issue is more a Docker issue than a Kubernetes one. |
How can we evict a pod that has a container restarting, among more than one container per pod? |
This project may be useful: https://github.com/ricardomaraschini/oomhero |
Ok, this has bitten me in the a** bigtime. Cost me a day to find out that one of my Python child processes was OOM killed. I would absolutely vote for a oom-killer which always kills the parent process no matter what. That makes the behaviour at least consistent. You assign a resource limit to the pod (as an entity), and clearly the pod went over that limit so it should have been restarted. |
It's vital not to unconditonally kill whole pods. Some apps rely on being able to handle child process OOMs themselves, and do not want the whole pod recursively killed. PostgreSQL for example. |
The same problem , but cause some serious consequences.
|
And recently in a PROD OOM Killer killed withing PG the bg_writer process, so our PG got in inconsistent state, along with it some logical and streaming replicas. (At least this what I observed and deducted from logs) |
@alexandru-lazarev If a Pg process gets OOM-killed the postmaster does an emergency shutdown and restart. It's inconvenient, but should never cause any data issues unless you're doing very unsafe things like running with the |
/kind bug
What happened:
A pod container reached its memory limit. Then the oom-killer killed only one process within the container. This container has a uwsgi python server which gave this error in its logs:
The only errors I could find in k8s were in the syslog on the node:
What you expected to happen:
I expected the whole container/pod to be terminated (and then restarted by the replica-set controller). I also expected to see "Restarts" count above 0 on the pods, and events in the pod or replica-set.
According to documentation at https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#exceed-a-containers-memory-limit the whole container should be terminated:
How to reproduce it (as minimally and precisely as possible):
Setup a multi-process server in a pod, e.g. uwsgi and django, where the uwsgi is the main process started in the container by k8s. Then have the child process use up more memory than the container limit.
Anything else we need to know?:
Another nice-to-have would be for the endpoint of a container that reaches the mem limit to immediately become not-ready and endpoints to be taken out of services until it passes health checks again. Because of the hard sigkill, we're not able to gracefully handle this condition and client connections get dropped. I saw the workaround in #40157, so we will try that.
Environment:
The text was updated successfully, but these errors were encountered: