-
Notifications
You must be signed in to change notification settings - Fork 40.6k
kubelet counts active page cache against memory.available (maybe it shouldn't?) #43916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm trying to better understand how the kernel deals with the page cache in terms of active and inactive pages, and I may have just discovered that it actually does not reclaim active page cache (for example, if you echo 3 to drop_caches), or at least doesn't know what to do with it if there is no swap available (and as recommended by the Kubernetes documentation, my nodes have swap disabled). So maybe I'm just totally wrong here, and I need to better understand specifically how the system considers page cache entries active, and work back from there. ...well, with another drop_caches test on a machine with 1.7GB Active(file) reporting, a substantial amount did get dumped, dropping Active(file) to ~134MB, so maybe I'm still onto something here. |
At this point in my research I'm wondering if drop_caches releases active page cache because it actually first moves pages to the inactive_list, then evicts from the inactive_list. And if something like that is happening, then maybe it's not possible to determine what from the active_list cold be dropped without iterating over it, which is not something cadvisor or kubelet would do. I guess I was hoping there'd be some stats exposed somewhere that could be used as a heuristic to determine with some reasonable approximation what could be dropped without having to do anything else, but maybe that just doesn't exist. But if that's the case, then it I wonder how it's possible to use memory eviction policies effectively. At this point I'm sufficiently dizzy from reading various source code and documents, and I'm just going to shut up now. |
@vdavidoff the memory management convoluted for sure. To start, I would agree that subtracting the active pages from available isn't a great heuristic. It is very pessimistic about reclaimable memory and how much of the active list could be reclaimed without pushing the system into a thrashing state. Doing a In order to get a good value for
This is not workable however, because it trashes the page cache system wide and would cause periodic performance degradation. So the question is "how can we determine the minimum amount of memory needed to support the currently running workload without thrashing". Not a easy question to answer. |
And to answer your question about how exactly
It actually works backward from the filesystem to find the pages that can be freed. It doesn't consider the LRU i.e. active/invactive pages lists |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/remove-lifecycle rotten it seems like we're affected by this problem as well. with tightly packed containers, long running jobs involving heavy disk I/O sporadically fail. take this example:
Test this with:
the latter command might take a moment to become available. The job tries to unpack 1TiB of zeroes (triple-compressed with zstd) - and There seems to be the problem like
The zstd used is a vanilla 1.2.0 packaged for xenial - previous versions are |
We don't currently have a reliable reproducer for this, but we often hit this when restoring large PostgreSQL backups with Is there a good way to fix this without a change to the kernel's implementation of cgroups, though? Should this perhaps rather be a kernel bug? |
This test load of mine seems to be killed reliably:
other region:
these events occur around the same time, I guess some other task is contending for resources, maybe causing buffer growth pushing it over the limit due to I/O slowdown of the PV? hard to guess. |
I ran into the same issue today. On node with 32GB memory, 16+GB is cached. When the memory used + cache exceeded 29GB (~90% of 32GB), the kubelet tried to evict all the pods which shouldn't have happened since the node still had close to 50% of memory available, although in cache. Is there a fix to this issue? |
As part of other investigations we've been recommended to use https://github.com/Feh/nocache to wrap the corresponding calls, which helped a fairly big amount :) |
Also having this problem. We had 53GB of available memory and 0.5GB free. 52.5GB is in buff/cache and it starts trying to kill pods due to SystemOOM. |
If it is indeed expected behaviour it certainly seems to surprise some people in nasty ways ... |
This is not expected behavior. The OS caching the memory has been around for a long time. Any app looking at memory usage should consider the cached memory. Using |
I've been digging in here trying to figure out why the workload @berlincount provided was OOMing. Given the simplicity, it just seemed odd. Using the exact Job spec provided by Andreas, I spun up a k8s cluster, ran the job, and watched the kernel stats for the pod. The kernel was performing sanely for the most part. When it started to come under memory pressure, it started evicting from the page cache. Eventually, the page cache values approached zero. The following is the output of the cgroup's memory.stat about thirty seconds before the OOM.
and this
is the cgroup stats from the OOM log. As you can see, kmem goes through the roof. I have no idea why, but the culprit doesn't appear to be the page cache or any of the processes in the container. From what I can tell, kmem counts against the cgroup's memory limit so it looks like the kernel is hogging the memory. Looking at the cgroups slabinfo, I can see a fairly large number of radix_node_tree slabs, though, I'm still not seeing where a whole GB of memory went. Of note, the memory in question appears to be freed after the cgroup is destroyed. I'm not 100% sure what's going on here but, to me, it looks like there may be a memory leak in the kernel OR I'm missing something (I personally find this second option to be slightly more plausible than the first). Any ideas? |
Interesting...the nodes I executed the above tests on were GKE COS nodes
Swapping the node out for an Ubuntu node seems to correct the memory usage. The pod never uses more than 600 MiB of RAM according to This is looking more and more like some kind of memory leak or misaccounting that's present in the 4.4 series kernel used in COS nodes but not in the 4.13 series used by Ubuntu nodes.
|
We seem to be having more success running our containers Ubuntu nodes ... so, I'd concur :) |
I'd like to share some observations, though I can't say I have a good solution to offer yet, other than to set a memory limit equal to the memory request for any pod that makes use of the file cache. Perhaps it's just a matter of documenting the consequences of not having a limit set. Or perhaps an explicit declaration of cache reservation should exist in the podspec, in lieu of assuming "inactive -> not important to reserve". Another possibility I've not explored is cgroup soft limits, and/or a more heuristic based detection of memory pressure. Contrary Interpretations of "Inactive"Kubernetes seems to have an implicit belief that the kernel is finding the working set, and keeping it in the active LRU. Everything not in the working state goes in the inactive LRU and is reclaimable. A quote from the documentation [emphasis added]:
Compare a comment from mm/workingset.c in the Linux kernel:
While both Kubernetes and Linux agree that the working set is in the active list, they disagree about where memory in excess of the working set goes. I'll show that Linux actually wants to minimize the size of the inactive list, putting all extra memory in the active list, as long as there's a process using the file cache enough for it to matter (which may not be the case if the workload on a node consists entirely of stateless web servers, for example). The DilemmaOne running an IO workload on Kubernetes must:
The Kernel ImplementationNote I'm not a kernel expert. These observations are based on my cursory study of the code. When a page is first loaded, Subsequent accesses to a page call Accessing a page twice is all that's required to get on the active list. If the inactive list is too big, there may not be enough room in the active list to contain the working set. If the inactive list is too small, pages may be pushed off the tail before they've had a chance to move to the active list, even if they are part of the working set. mm/workingset.c deals with this balance. It forms estimates from the inactive file LRU list stats and maintains "shadow entries" for pages recently evicted from the inactive list. When So a page accessed twice gets it added to the active list. What puts downward pressure on the active list? During scans (normally by kswapd, but directly by an allocation if there are insufficient free pages), The comments to
So, the presence of refaults (meaning, pages are faulted, pushed off the inactive list, then faulted again) indicates the inactive list is too small, which means the active list is too big. If refaults aren't happening then the ratio of active:inactive is capped by a formula based on the total size of inactive + active. A larger cache favors a larger active list in proportion to the inactive list. I believe (through I've not confirmed with experiment) that the presence of a large number of refaults could also mean there simply isn't enough memory available to contain the working set. The refaults will cause the inactive list to grow and the active list to shrink, causing Kubernetes to think there is less memory pressure, the opposite of reality! |
+1 👍 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
For OOMs that's related to dirty pages, try setting
Note that we are using a kernel that has backported |
We have tried @wenjianhn 's workaround setting We also noticed that Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha) follows the same approach by improving the handling of |
Trying to summarize information on this for my own understanding... Any clarifications appreciated! As far as I can tell, although it seems cgroup v2 alone may not fully address this, the memory QoS feature KEP 2570 introduced in k8s 1.22, which uses new features of cgroup v2, is meant to address this. A comment in the PR for that KEP says:
However based on a careful reading of the changes in k8s 1.27, a Guaranteed pod (req=lim) no longer results in setting The memory QoS feature gate remains alpha (off by default) in k8s 1.30. |
My understanding is as follows:
PSI is an observability feature that merely measures the delays caused by reclaim. In my opinion it has no relevance here. The kernel's reclaim algorithm will trigger when cgroup v1's memory.limit or cgroup v2's memory.high or memory.max is reached. The only difference here between memory.high and memory.max is that memory.high will trigger earlier and not cause an OOM if the reclaim fails, but in both cases it will reclaim all page cache before failing. TLDR: Slightly overcommit node memory available for scheduling. Use sensible memory.request and memory.limit values for all containers. If your application needs page cache to function, make room for it in your memory.request. If your nodes have swap enabled then cgroup v2's memory.high (i.e. Kubernetes memory QoS feature) together with PSI metrics can gracefully degrade your workload giving you more time to increase memory.request before an OOM kill but does nothing new to solve the node pressure eviction issue that we're talking about here. |
Are you sure the page cache (active files) won't cause OOM kills? When container_memory_working_set_bytes metric reaches the container memory limit, the container is expected to be OOM killed. This metric takes account into active page caches. That's why some people here are desperately trying to drop caches for IO bound pods. Setting request=limit may help in node evictions but won't help for OOM kills. |
Clearly, inactive pages will always be reclaimed before an OOM kill. The kernel tries to keep a balance between active and inactive lists and adjusts them periodically, moving pages back from active to inactive. It should be noted here that the kernel is rather conservative and keeps pages in active list that haven't been accessed for some time if there is no pressure to balance the lists, so the active list can be larger than the actual working set. It is my understanding that a page on the active list has to be unreferenced for two consecutive "cycles" to move back to the inactive list via asynchronous reclaim. An OOM kill happens when a page fault can not be satisfied from the kernel's list of free pages and the synchronous direct reclaim fails to evict enough pages. The question is then, does the direct reclaim evict pages directly from the active list or at least rebalance it so that they can be evicted from the inactive list in the next scan? The entry point for direct reclaim is Here's an older blog post from 2015 that also shows how And here is StackOverflow post where a user creates a Kernel patch specifically to avoid reclaim of active page cache. While I'm far from an expert on the Linux reclaim algorithm, given this evidence I do believe that active file pages will generally be reclaimed before an OOM kill is triggered, absent new information. I can imagine a few reasons why an OOM may occur while there is still active file usage reported: wrong NUMA zone, take too long to write back (file system locks preventing writeout), memory fragmentation, slow metrics collection, etc.
In any case, I don't see what goal this is supposed to achieve. If this page cache is on the inactive lru, then it will surely be reclaimed before an OOM, I think everyone can agree on that. If it is on the active lru and part of the true working set, then dropping it manually will do nothing as it will be faulted in again almost immediately. If this page cache is "active" but not truly used, then the kernel will eventually move it back to inactive when balancing the lrus, where again it can be reclaimed. |
Thanks for the detailed info. |
Can you link to that documentation page? I'm not aware of any Kubernetes-controlled OOM tool (besides pod evictions). As far as I know, actual OOM kills are handled transparently and exclusively by the operating system (through cgroups in linux). My guess is that they either mean a pod eviction (which directly tracks cAdvisor's Anyone who is interested in Kubernetes OOMs should definitely read this excellent blog post. There is a flow chart at the end showing all the different OOM scenarios. (But don't forget that the blog post is referring to outdated cgroups v1 in many places). When pod evictions are the problem then |
This is the official documentation of an oom killed pod https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#exceed-a-container-s-memory-limit It says a pod becomes a candidate for termination and killed if it reaches the memory limit. However it doesn't depict how a container is evaluated to have reached its limit (which metric?) or who actually kills it (kernel or kubelet?) When I google "container_memory_working_set_bytes and oom killer" I see non-official articles saying that the container_memory_working_set_bytes is monitored and if exceeded the container would be killed by kubelet. May be this is one of those internet myths or misinformation spread. Wish the official documents were more clear. |
+1 to #43916 (comment). Setting cc @dchen1107 |
To my knowledge, the contents of these posts propagate myths. I basically agree with what @tschuchortdev has been saying in this thread. In my experience, The really difficult thing here is that memory accounting is very complicated. More complicated than one number can surface to users. |
@tschuchortdev |
Both can be true. It is my belief (according to above investigation) that
What sort of help are you looking for? In the blog post series I linked earlier there should be a program that reserves page cache, if I remember correctly. You could use that to test the behaviour of OOM killer and kubelet pod evictions experimentally. |
Thanks, i will read the blog carefully, and do some tests. Your answer is help!
By the way, is this your guess around the official verdict, or is it a proven conclusion? Is it possible for us to send kubelet to reclaim |
I think the fact that
|
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
active_file
inactive_file
working_set
WorkingSet
cAdvisor
memory.available
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
We'll say BUG REPORT (though this is arguable)
Kubernetes version (use
kubectl version
):1.5.3
Environment:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
Kernel (e.g.
uname -a
):Linux HOSTNAME_REDACTED 3.13.0-44-generic kube-up: fix gcloud version check #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:
What happened:
A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn't have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.
What you expected to happen:
memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it's effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods).
How to reproduce it (as minimally and precisely as possible):
Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory.
Anything else we need to know:
I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request (conversation starts here).
Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe this calculation:
memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
Is using cAdvisor's value for working set, which if I traced the code correctly, amounts to:
$cgroupfs/memory.usage_in_bytes - total_inactive_file
Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache:
$kernel/Documentation/cgroups/memory.txt
Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there's no way to to know (1) generally how much page cache will be active across a cluster's nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations.
I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I'm missing anything major, and I appreciate any feedback you all can provide.
As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr
The text was updated successfully, but these errors were encountered: