New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet counts active page cache against memory.available (maybe it shouldn't?) #43916
Comments
I'm trying to better understand how the kernel deals with the page cache in terms of active and inactive pages, and I may have just discovered that it actually does not reclaim active page cache (for example, if you echo 3 to drop_caches), or at least doesn't know what to do with it if there is no swap available (and as recommended by the Kubernetes documentation, my nodes have swap disabled). So maybe I'm just totally wrong here, and I need to better understand specifically how the system considers page cache entries active, and work back from there. ...well, with another drop_caches test on a machine with 1.7GB Active(file) reporting, a substantial amount did get dumped, dropping Active(file) to ~134MB, so maybe I'm still onto something here. |
At this point in my research I'm wondering if drop_caches releases active page cache because it actually first moves pages to the inactive_list, then evicts from the inactive_list. And if something like that is happening, then maybe it's not possible to determine what from the active_list cold be dropped without iterating over it, which is not something cadvisor or kubelet would do. I guess I was hoping there'd be some stats exposed somewhere that could be used as a heuristic to determine with some reasonable approximation what could be dropped without having to do anything else, but maybe that just doesn't exist. But if that's the case, then it I wonder how it's possible to use memory eviction policies effectively. At this point I'm sufficiently dizzy from reading various source code and documents, and I'm just going to shut up now. |
@vdavidoff the memory management convoluted for sure. To start, I would agree that subtracting the active pages from available isn't a great heuristic. It is very pessimistic about reclaimable memory and how much of the active list could be reclaimed without pushing the system into a thrashing state. Doing a In order to get a good value for
This is not workable however, because it trashes the page cache system wide and would cause periodic performance degradation. So the question is "how can we determine the minimum amount of memory needed to support the currently running workload without thrashing". Not a easy question to answer. |
And to answer your question about how exactly
It actually works backward from the filesystem to find the pages that can be freed. It doesn't consider the LRU i.e. active/invactive pages lists |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/remove-lifecycle rotten it seems like we're affected by this problem as well. with tightly packed containers, long running jobs involving heavy disk I/O sporadically fail. take this example:
Test this with:
the latter command might take a moment to become available. The job tries to unpack 1TiB of zeroes (triple-compressed with zstd) - and There seems to be the problem like
The zstd used is a vanilla 1.2.0 packaged for xenial - previous versions are |
We don't currently have a reliable reproducer for this, but we often hit this when restoring large PostgreSQL backups with Is there a good way to fix this without a change to the kernel's implementation of cgroups, though? Should this perhaps rather be a kernel bug? |
This test load of mine seems to be killed reliably:
other region:
these events occur around the same time, I guess some other task is contending for resources, maybe causing buffer growth pushing it over the limit due to I/O slowdown of the PV? hard to guess. |
I ran into the same issue today. On node with 32GB memory, 16+GB is cached. When the memory used + cache exceeded 29GB (~90% of 32GB), the kubelet tried to evict all the pods which shouldn't have happened since the node still had close to 50% of memory available, although in cache. Is there a fix to this issue? |
As part of other investigations we've been recommended to use https://github.com/Feh/nocache to wrap the corresponding calls, which helped a fairly big amount :) |
Also having this problem. We had 53GB of available memory and 0.5GB free. 52.5GB is in buff/cache and it starts trying to kill pods due to SystemOOM. |
If it is indeed expected behaviour it certainly seems to surprise some people in nasty ways ... |
This is not expected behavior. The OS caching the memory has been around for a long time. Any app looking at memory usage should consider the cached memory. Using |
I've been digging in here trying to figure out why the workload @berlincount provided was OOMing. Given the simplicity, it just seemed odd. Using the exact Job spec provided by Andreas, I spun up a k8s cluster, ran the job, and watched the kernel stats for the pod. The kernel was performing sanely for the most part. When it started to come under memory pressure, it started evicting from the page cache. Eventually, the page cache values approached zero. The following is the output of the cgroup's memory.stat about thirty seconds before the OOM.
and this
is the cgroup stats from the OOM log. As you can see, kmem goes through the roof. I have no idea why, but the culprit doesn't appear to be the page cache or any of the processes in the container. From what I can tell, kmem counts against the cgroup's memory limit so it looks like the kernel is hogging the memory. Looking at the cgroups slabinfo, I can see a fairly large number of radix_node_tree slabs, though, I'm still not seeing where a whole GB of memory went. Of note, the memory in question appears to be freed after the cgroup is destroyed. I'm not 100% sure what's going on here but, to me, it looks like there may be a memory leak in the kernel OR I'm missing something (I personally find this second option to be slightly more plausible than the first). Any ideas? |
Interesting...the nodes I executed the above tests on were GKE COS nodes
Swapping the node out for an Ubuntu node seems to correct the memory usage. The pod never uses more than 600 MiB of RAM according to This is looking more and more like some kind of memory leak or misaccounting that's present in the 4.4 series kernel used in COS nodes but not in the 4.13 series used by Ubuntu nodes.
|
We seem to be having more success running our containers Ubuntu nodes ... so, I'd concur :) |
I'd like to share some observations, though I can't say I have a good solution to offer yet, other than to set a memory limit equal to the memory request for any pod that makes use of the file cache. Perhaps it's just a matter of documenting the consequences of not having a limit set. Or perhaps an explicit declaration of cache reservation should exist in the podspec, in lieu of assuming "inactive -> not important to reserve". Another possibility I've not explored is cgroup soft limits, and/or a more heuristic based detection of memory pressure. Contrary Interpretations of "Inactive"Kubernetes seems to have an implicit belief that the kernel is finding the working set, and keeping it in the active LRU. Everything not in the working state goes in the inactive LRU and is reclaimable. A quote from the documentation [emphasis added]:
Compare a comment from mm/workingset.c in the Linux kernel:
While both Kubernetes and Linux agree that the working set is in the active list, they disagree about where memory in excess of the working set goes. I'll show that Linux actually wants to minimize the size of the inactive list, putting all extra memory in the active list, as long as there's a process using the file cache enough for it to matter (which may not be the case if the workload on a node consists entirely of stateless web servers, for example). The DilemmaOne running an IO workload on Kubernetes must:
The Kernel ImplementationNote I'm not a kernel expert. These observations are based on my cursory study of the code. When a page is first loaded, Subsequent accesses to a page call Accessing a page twice is all that's required to get on the active list. If the inactive list is too big, there may not be enough room in the active list to contain the working set. If the inactive list is too small, pages may be pushed off the tail before they've had a chance to move to the active list, even if they are part of the working set. mm/workingset.c deals with this balance. It forms estimates from the inactive file LRU list stats and maintains "shadow entries" for pages recently evicted from the inactive list. When So a page accessed twice gets it added to the active list. What puts downward pressure on the active list? During scans (normally by kswapd, but directly by an allocation if there are insufficient free pages), The comments to
So, the presence of refaults (meaning, pages are faulted, pushed off the inactive list, then faulted again) indicates the inactive list is too small, which means the active list is too big. If refaults aren't happening then the ratio of active:inactive is capped by a formula based on the total size of inactive + active. A larger cache favors a larger active list in proportion to the inactive list. I believe (through I've not confirmed with experiment) that the presence of a large number of refaults could also mean there simply isn't enough memory available to contain the working set. The refaults will cause the inactive list to grow and the active list to shrink, causing Kubernetes to think there is less memory pressure, the opposite of reality! |
+1 👍 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
The kubelet will terminate end-user pods when the worker node has 'MemoryPressure' according to [1]. But confusingly, there exits two reasons for pods being evicted: - one is the whole machine's free memory is too low, - the other is k8s itself calculation[2], e.i. memory.available[3] is too low. To resolve such confusion for k8s users, collect and show k8s global workingset memory to distinguish between these two causes. Note: 1. Only collect k8s global memory stats is enough, this is because cgroupfs stats are propagated from child to parent. Thus the parent can always notice the change and then updates. And From v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more convincing than capacity(/sys/fs/cgroup/memory/). 2. There are two cgroup drivers or managers to control resources: cgroupfs and systemd[5]. We should take both into account. (The 'systemd' cgroup driver always ends with '.slice') 3. The difference between cgroupv1 and cgroupv2: different field names for memory.stat file, and memory.currentUsage storing in different files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current). [1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior [2]kubernetes/kubernetes#43916 [3]memory.available = memory.allocatable/capacity - memory.workingSet, memory.workingSet = memory.currentUsage - memory.inactivefile [4]kubernetes/kubernetes#42204 kubernetes/community#348 [5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/ Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Reported-by: Teng Hu <huteng.ht@bytedance.com>
maybe related to google/cadvisor#3286 |
Has anyone been able to confirm this? |
We've noticed that prometheus with thanos sidecars increases All the containers in this scenario have a generous guaranteed memory limit. With large enough prometheus data retention / metric count and limited enough memory, This makes it difficult to alert and know when prometheus instances with thanos sidecars are running out of memory and are close to OOM. I'm sure this applies to other heavy data workloads such as a mysql database with a backup sidecar. In short, It would be great if |
I've opened a cAdvisor PR to expose a non-evictable memory metric: google/cadvisor#3445 Right now it excludes both |
Another +1 for this running as a Daemonset - immediate relief on memory pressure (cursory glance at It'll do for now 🤷 |
This works by clearing the page cache it seems, some apps can get their performance crippled with clearing the page cache if they are read intensive apps. It is a smart work-around, but be aware of what type of app you are running and how clearing the cache continuously might affect performance negatively, especially with network based storages. |
Hi @IgorBerman Would you mind sharing more details about this issue? We seem to have encountered a similar problem. |
Using cgroups v2 does not seem to solve the problem. We have an AKS cluster, that runs cgroups v2 as default, and are still experiencing same problem. |
same as hterik, |
I have recently also faced this issue in cluster when exporting data to a cloud storage periodically. The memory manager (https://github.com/linchpiner/cgroup-memory-manager) did not work for me very well as it destroyed the performance of my original functionality. As I/O operation consumes page cache in linux by default, another workaround is to use the O_DIRECT flag to read/write data to bypass page cache in the system. It consumed quite some cpu & ram and affected performance to some extent, but we eventually managed to export files with an acceptable compromise in performance by setting a relatively large export batch size via testing. Hope this helps a bit if you also have similar memory issues with frequent I/O operations in cluster. |
Above workaround unfortunately only works with cgroups v1, I believe this is a rough, quick-and-dirty bash equivalent for cgroups v2, only partly tested (I believe it works after running it for a couple days, but haven't rigorously tested):
(part of a Daemonset definition that also dynamically provisions disks for swap) |
Cgroup v2 doesn't have this issue, not sure what you are expecting the DS to do? |
@alvaroaleman interesting, are you able to confirm that cgroups v2 doesn't have this issue (given conflicting reports above)? If so I am hitting a rather similar-looking issue, specifically running with swap, that I am 90% confident is solved by the above script (based on a workload which was reliably triggering "The node was low on resource: memory. Threshold quantity: 100Mi", and now isn't; I can confirm that there's definitely enough memory+swap for the workload at every point) Also, I can confirm that the script did not work when I was looking at MemAvailable rather than MemFree |
…ubernetes/kubernetes#43916 Signed-off-by: Changxin Miao <miaochangxin@step.ai>
The goal of this PR is to have additional cAdvisor metrics which expose total_active_file and total_inactive_file. Today working_set_bytes subtracts total_inactive_file in its calculation, but there are situations where exposing these metrics directly is valuable. For example, two containers sharing files in an emptyDir increases total_active_file over time. This is not tracked in the working_set memory. Exposing total_active_file and total_inactive_file to the user allows them to subtract out total_active_file or total_inactive_file if they so choose in their alerts. In the case of prometheus with a thanos sidecar, working_set can give a false sense of high memory usage. The kernel counts thanos reading prometheus written files as "active_file" memory. In that situation, a user may want to exclude active_file from their ContainerLowOnMemory alert. Relates to: kubernetes/kubernetes#43916
The goal of this PR is to have additional cAdvisor metrics which expose total_active_file and total_inactive_file. Today working_set_bytes subtracts total_inactive_file in its calculation, but there are situations where exposing these metrics directly is valuable. For example, two containers sharing files in an emptyDir increases total_active_file over time. This is not tracked in the working_set memory. Exposing total_active_file and total_inactive_file to the user allows them to subtract out total_active_file or total_inactive_file if they so choose in their alerts. In the case of prometheus with a thanos sidecar, working_set can give a false sense of high memory usage. The kernel counts thanos reading prometheus written files as "active_file" memory. In that situation, a user may want to exclude active_file from their ContainerLowOnMemory alert. Relates to: kubernetes/kubernetes#43916
The goal of this PR is to have additional cAdvisor metrics which expose total_active_file and total_inactive_file. Today working_set_bytes subtracts total_inactive_file in its calculation, but there are situations where exposing these metrics directly is valuable. For example, two containers sharing files in an emptyDir increases total_active_file over time. This is not tracked in the working_set memory. Exposing total_active_file and total_inactive_file to the user allows them to subtract out total_active_file or total_inactive_file if they so choose in their alerts. In the case of prometheus with a thanos sidecar, working_set can give a false sense of high memory usage. The kernel counts thanos reading prometheus written files as "active_file" memory. In that situation, a user may want to exclude active_file from their ContainerLowOnMemory alert. Relates to: kubernetes/kubernetes#43916
@alvaroaleman Also keen to know if cgroups v2 solves this problem. What are the steps you took to verify this? |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
active_file
inactive_file
working_set
WorkingSet
cAdvisor
memory.available
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
We'll say BUG REPORT (though this is arguable)
Kubernetes version (use
kubectl version
):1.5.3
Environment:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
Kernel (e.g.
uname -a
):Linux HOSTNAME_REDACTED 3.13.0-44-generic kube-up: fix gcloud version check #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:
What happened:
A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn't have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.
What you expected to happen:
memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it's effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods).
How to reproduce it (as minimally and precisely as possible):
Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory.
Anything else we need to know:
I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request (conversation starts here).
Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe this calculation:
memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
Is using cAdvisor's value for working set, which if I traced the code correctly, amounts to:
$cgroupfs/memory.usage_in_bytes - total_inactive_file
Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache:
$kernel/Documentation/cgroups/memory.txt
Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there's no way to to know (1) generally how much page cache will be active across a cluster's nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations.
I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I'm missing anything major, and I appreciate any feedback you all can provide.
As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr
The text was updated successfully, but these errors were encountered: