kubelet counts active page cache against memory.available (maybe it shouldn't?) #43916

vdavidoff · 2017-03-31T16:30:13Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
active_file
inactive_file
working_set
WorkingSet
cAdvisor
memory.available

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
We'll say BUG REPORT (though this is arguable)

Kubernetes version (use kubectl version):
1.5.3

Environment:

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
Kernel (e.g. uname -a):
Linux HOSTNAME_REDACTED 3.13.0-44-generic kube-up: fix gcloud version check #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Others:

What happened:
A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn't have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.

What you expected to happen:
memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it's effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods).

How to reproduce it (as minimally and precisely as possible):
Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory.

Anything else we need to know:
I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request (conversation starts here).

Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe this calculation:

memory.available := node.status.capacity[memory] - node.stats.memory.workingSet

Is using cAdvisor's value for working set, which if I traced the code correctly, amounts to:

$cgroupfs/memory.usage_in_bytes - total_inactive_file

Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache:

$kernel/Documentation/cgroups/memory.txt

 
The core of the design is a counter called the res_counter. The res_counter
tracks the current memory usage and limit of the group of processes associated
with the controller.
 
...
 
2.2.1 Accounting details
 
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there's no way to to know (1) generally how much page cache will be active across a cluster's nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations.

I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I'm missing anything major, and I appreciate any feedback you all can provide.

As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr

The text was updated successfully, but these errors were encountered:

vdavidoff · 2017-03-31T17:44:30Z

I'm trying to better understand how the kernel deals with the page cache in terms of active and inactive pages, and I may have just discovered that it actually does not reclaim active page cache (for example, if you echo 3 to drop_caches), or at least doesn't know what to do with it if there is no swap available (and as recommended by the Kubernetes documentation, my nodes have swap disabled). So maybe I'm just totally wrong here, and I need to better understand specifically how the system considers page cache entries active, and work back from there.

...well, with another drop_caches test on a machine with 1.7GB Active(file) reporting, a substantial amount did get dumped, dropping Active(file) to ~134MB, so maybe I'm still onto something here.

vdavidoff · 2017-04-01T01:37:38Z

At this point in my research I'm wondering if drop_caches releases active page cache because it actually first moves pages to the inactive_list, then evicts from the inactive_list. And if something like that is happening, then maybe it's not possible to determine what from the active_list cold be dropped without iterating over it, which is not something cadvisor or kubelet would do. I guess I was hoping there'd be some stats exposed somewhere that could be used as a heuristic to determine with some reasonable approximation what could be dropped without having to do anything else, but maybe that just doesn't exist. But if that's the case, then it I wonder how it's possible to use memory eviction policies effectively.

At this point I'm sufficiently dizzy from reading various source code and documents, and I'm just going to shut up now.

sjenning · 2017-04-03T17:19:07Z

@vdavidoff the memory management convoluted for sure.

To start, I would agree that subtracting the active pages from available isn't a great heuristic. It is very pessimistic about reclaimable memory and how much of the active list could be reclaimed without pushing the system into a thrashing state.

Doing a sync;echo 1 > drop_caches will free as much pagecache as possible. Keep in mind that this doesn't drop dirty or locked pages from the cache, hence the sync before the drop to maximize the cleanness of the cache and maximize reclaim.

In order to get a good value for available, which is "capacity - how much the system needs in order to run without thrashing, as indicated by high major fault rates", one would need to do something like

sync
echo 1 > drop_caches
sleep 10 # let processes fault their active working set back in
use calculation we currently use, as active_pages will now be closer to the value of "memory required to avoid thrashing"

This is not workable however, because it trashes the page cache system wide and would cause periodic performance degradation.

So the question is "how can we determine the minimum amount of memory needed to support the currently running workload without thrashing". Not a easy question to answer.

sjenning · 2017-04-03T17:39:55Z

And to answer your question about how exactly drop_caches works, basically like this

for each superblock
  for each cached inode
    invalidate all page mappings to the inode and release the pages

It actually works backward from the filesystem to find the pages that can be freed. It doesn't consider the LRU i.e. active/invactive pages lists

fejta-bot · 2017-12-22T21:48:40Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-21T22:36:23Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

berlincount · 2018-02-16T16:28:11Z

/remove-lifecycle rotten

it seems like we're affected by this problem as well. with tightly packed containers, long running jobs involving heavy disk I/O sporadically fail.

take this example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: democlaim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd
  resources:
    requests:
      storage: 1.2Ti
---
apiVersion: batch/v1
kind: Job
metadata:
  name: demo
spec:
  template:
    spec:
      containers:
      - name: demo
        image: ubuntu
        command: ["bash",  "-c", "apt-get update ; apt-get install -y wget ; wget -O /data/zstd.deb https://packages.shopify.io/shopify/public/packages/ubuntu/xenial/zstd_1.2.0-0shopify_amd64.deb/download ; wget -O /data/libzstd.deb https://packages.shopify.io/shopify/public/packages/ubuntu/xenial/libzstd1_1.2.0-0shopify_amd64.deb/download ; dpkg -i /data/libzstd.deb /data/zstd.deb ; echo 'KLUv/QRQtR8ApnmvQMAaBwCp6S2VEGAQoIMR3DIbNd4HvrRTZ9cQVwgYX19vUMlci2xnmLLgkNZaGsZmRkAEFuSmnbH8UpgxwUmkdx6yAJoAhwDu8W4cEEiofKDDBIa1pguh/vv4eVH7f7qHvH1N93OmnQ312X+6h8rb+nS0n/eh6s+rP5MZwQUC7cOaJEJuelbbWzpqfZ6advxPlOv6Ha8/D2jCPwQceFDCqIIDoAAASmhMkDoVisCA6fmpJd0HKRY7+s/P0QkkGjVYP2dNCGq1WHe1XK2WqxUkwdVCGetBQRRYLBbrNFTEjlTgMLEiZmLIRYgWT9MzTQ+Uo2AUoWhAWFQB7iFvo6YSZNHNSY5U9n92D5W3/6d7P2+jZv8DWFs0oHjNZLU27B4qb/9P93gfavaf7iEWSYETxQOO2GqrJfH2Nd1b3X66d14DQbo8veCxY7W1GR1/uP2ne8jb16d7ADEH3qhAALGBYQPxeek4lUJjBMlpJuuhC/H8R9Ltp3vnRc2n/+6hEm2jyhSYMUT1hcBqq935072ft1GfHajbf7qH523U59PP7qESLQIiSTJcec1k7eF/uvfzNuqz/7N7qLz9P1FkmMw4ovLCtNo6wDtvoz77T/dQeVuf7v3sIIR5qM/+0z1U3r4+He28D/X9p3t43tb8zw91++fBcKaIhddM1k3uIW9fn+7sHur2/3QPk8UFlROB1dYxtlHT/Wf3ULf/Z1ZoNs6IyBb85CSerVSKFAx41MB6w6/0P91D3r4+3Xkf6vbvkqHlgQOYnNdMVhOS7p23Nf+ze6jbf7p3BotKKQwMvPuxqwxtQZGGqcNqax0+2vGxUwUQqBInORbAsCc4/utsIMHIjtdMVpvD+red51xycsNhyg2m1VZb0r/NqyyLUo8lW8t5/jf62eehbv9P9/C8jfp0/9kZiGP0JkILoBIw7KSCLKhikCMAAQHAgoA+Yk8AAjQQEoAACAxBQFcEQGCAcCuCACAAiAYIqDACVyuMeO/lZnP49YuJifET/DqMhFOzZUJDc6W5kGD1OGhhORIUxs/EoaGhcmhoLdYShiQhNGm//E0IDUWEMnroe0JoaDQcaLcMp63yfIuKck/X8QoCDbQRklMBggIErDt1qfySehKEwet2c/0/MLRMEH5ZAxq+RlpgiN8BMOMwt+HwGvF3W2aM0KjIUT/Em+cFyAEQMGUIEjCG7YLmcKmhA6ySpQ7QIJao+Tr/Ygp+MGmXtAyBBdHa63eY+W9lcdCVFioqTUB7WITH0ZAfgx5TXMzXgcmge1Iy3CK3WCk0xRLDTbllx2Ar9yhMpUkwoEDYJnasQZrXT/4JjLxAaWX9iX77a1KsfrFu5j8fRZmwDg==' | base64 -d > /data/1TiB_of_zeroes.tar.zst.zst.zst ; echo -n 'Unpacking ' ; zstd -d -T4 < /data/1TiB_of_zeroes.tar.zst.zst.zst | zstd -d -T4 | zstd -d -T4 | tar -C /data -xvf - && echo successful. || echo failed."]
        securityContext:
          privileged: false
        volumeMounts:
          - name: data
            mountPath: /data
            readOnly: false
        resources:
          requests:
            cpu: "4"
            memory: "1Gi"
          limits:
            cpu: "4"
            memory: "1Gi"
      restartPolicy: Never
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: democlaim
  backoffLimit: 0

Test this with:

kubectl create namespace tarsplosion
kubectl --namespace tarsplosion create -f ./demo.yml
kubectl --namespace tarsplosion logs job/demo --follow

the latter command might take a moment to become available.

The job tries to unpack 1TiB of zeroes (triple-compressed with zstd) - and
apparently fails because of memory exhaustion by buffers filled by tar.

There seems to be the problem like
https://serverfault.com/questions/704443/tar-uses-too-much-memory-for-its-buffer-workaround

the job only fails sometimes, but then in a nasty fashion.

The zstd used is a vanilla 1.2.0 packaged for xenial - previous versions are
not multithreaded and have a slightly different file format.

mgomezch · 2018-02-16T16:44:34Z

We don't currently have a reliable reproducer for this, but we often hit this when restoring large PostgreSQL backups with pg_basebackup. A particularly horrible but effective hack to help the backup restore process complete is to exec into the pod and sync; echo 1 > drop_caches repeatedly as suggested above (it also helps to sigstop/cont the backup process while flushing the cache).

Is there a good way to fix this without a change to the kernel's implementation of cgroups, though? Should this perhaps rather be a kernel bug?

berlincount · 2018-02-20T10:07:36Z

This test load of mine seems to be killed reliably:

[..]
Unpacking 1TiB_of_zeroes
tar: 1TiB_of_zeroes: Wrote only 2560 of 10240 bytes
$ kubectl --namespace tarsplosion get events
LAST SEEN   FIRST SEEN   COUNT     NAME                          KIND      SUBOBJECT               TYPE      REASON                 SOURCE                                            MESSAGE
12m         12m          1         demo-xmnr6.1514cb1745ab1ac3   Pod                               Normal    SandboxChanged         kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Pod sandbox changed, it will be killed and re-created.
11m         11m          1         demo-xmnr6.1514cb1e485790c7   Pod       spec.containers{demo}   Normal    Killing                kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Killing container with id docker://demo:Need to kill Pod
11m         11m          1         demo.1514cb1e8c67ab56         Job                               Warning   BackoffLimitExceeded   job-controller                                    Job has reach the specified backoff limit
$ kubectl --namespace tarsplosion get pods -a
NAME         READY     STATUS      RESTARTS   AGE
demo-xmnr6   0/1       OOMKilled   0          17h

other region:

[..]
Unpacking 1TiB_of_zeroes
tar: 1TiB_of_zeroes: Wrote only 512 of 10240 bytes
$ kubectl --namespace tarsplosion get events
LAST SEEN   FIRST SEEN   COUNT     NAME                          KIND      SUBOBJECT               TYPE      REASON                 SOURCE                                            MESSAGE
11m         11m          1         demo-xmnr6.1514cb1745ab1ac3   Pod                               Normal    SandboxChanged         kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Pod sandbox changed, it will be killed and re-created.
11m         11m          1         demo-xmnr6.1514cb1e485790c7   Pod       spec.containers{demo}   Normal    Killing                kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Killing container with id docker://demo:Need to kill Pod
11m         11m          1         demo.1514cb1e8c67ab56         Job                               Warning   BackoffLimitExceeded   job-controller                                    Job has reach the specified backoff limit
$ kubectl --namespace tarsplosion get pods -a
NAME         READY     STATUS      RESTARTS   AGE
demo-s7tzf   0/1       OOMKilled   0          17h

these events occur around the same time, I guess some other task is contending for resources, maybe causing buffer growth pushing it over the limit due to I/O slowdown of the PV? hard to guess.

devopsprosiva · 2018-03-01T23:58:39Z

I ran into the same issue today. On node with 32GB memory, 16+GB is cached. When the memory used + cache exceeded 29GB (~90% of 32GB), the kubelet tried to evict all the pods which shouldn't have happened since the node still had close to 50% of memory available, although in cache. Is there a fix to this issue?

berlincount · 2018-03-16T15:01:24Z

As part of other investigations we've been recommended to use https://github.com/Feh/nocache to wrap the corresponding calls, which helped a fairly big amount :)

treacher · 2018-03-21T01:42:22Z

Also having this problem. We had 53GB of available memory and 0.5GB free. 52.5GB is in buff/cache and it starts trying to kill pods due to SystemOOM.

berlincount · 2018-03-21T08:15:34Z

If it is indeed expected behaviour it certainly seems to surprise some people in nasty ways ...

devopsprosiva · 2018-03-21T20:56:44Z

This is not expected behavior. The OS caching the memory has been around for a long time. Any app looking at memory usage should consider the cached memory. Using nocache is not an ideal solution either. Is there anyway we can bump up the severity/need on this issue? We're planning to go into production soon but can't without this issue getting fixed

thefirstofthe300 · 2018-03-30T18:34:04Z

I've been digging in here trying to figure out why the workload @berlincount provided was OOMing. Given the simplicity, it just seemed odd.

Using the exact Job spec provided by Andreas, I spun up a k8s cluster, ran the job, and watched the kernel stats for the pod. The kernel was performing sanely for the most part. When it started to come under memory pressure, it started evicting from the page cache. Eventually, the page cache values approached zero.

The following is the output of the cgroup's memory.stat about thirty seconds before the OOM.

cache 139264
rss 3411968
rss_huge 0
mapped_file 61440
dirty 0
writeback 0
swap 0
pgpgin 115745935
pgpgout 115745068
pgfault 418117
pgmajfault 46033
inactive_anon 0
active_anon 3411968
inactive_file 24576
active_file 114688
unevictable 0
hierarchical_memory_limit 1073741824
hierarchical_memsw_limit 2147483648
total_cache 131072
total_rss 3411968
total_rss_huge 0
total_mapped_file 61440
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 115745935
total_pgpgout 115745070
total_pgfault 418117
total_pgmajfault 46033
total_inactive_anon 0
total_active_anon 3411968
total_inactive_file 16384
total_active_file 0
total_unevictable 0

and this

[ 6971.999289] memory: usage 1048576kB, limit 1048576kB, failcnt 38707874
[ 6972.005986] memory+swap: usage 1048576kB, limit 9007199254740988kB, failcnt 0
[ 6972.013451] kmem: usage 1045024kB, limit 9007199254740988kB, failcnt 0
[ 6972.020113] Memory cgroup stats for /kubepods/podf78402b6-33bd-11e8-ba59-42010a8a0160: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 6972.044815] Memory cgroup stats for /kubepods/podf78402b6-33bd-11e8-ba59-42010a8a0160/c38dca77af0d0c47320ee1aeffe4224894ae05d14cb0a54cc0d7fcb5f781fd0f: cache:0KB rss:40KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:40KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 6972.075069] Memory cgroup stats for /kubepods/podf78402b6-33bd-11e8-ba59-42010a8a0160/d7a947146cf94e188208fc1633a2f1275f9e75a2e8c7334133363a7743e81858: cache:116KB rss:3396KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:3396KB inactive_file:0KB active_file:4KB unevictable:0KB

is the cgroup stats from the OOM log. As you can see, kmem goes through the roof. I have no idea why, but the culprit doesn't appear to be the page cache or any of the processes in the container.

From what I can tell, kmem counts against the cgroup's memory limit so it looks like the kernel is hogging the memory.

Looking at the cgroups slabinfo, I can see a fairly large number of radix_node_tree slabs, though, I'm still not seeing where a whole GB of memory went.

Of note, the memory in question appears to be freed after the cgroup is destroyed.

I'm not 100% sure what's going on here but, to me, it looks like there may be a memory leak in the kernel OR I'm missing something (I personally find this second option to be slightly more plausible than the first).

Any ideas?

thefirstofthe300 · 2018-03-30T20:58:09Z

Interesting...the nodes I executed the above tests on were GKE COS nodes

$ cat /etc/os-release
BUILD_ID=10323.12.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=2d7de0bde20ae17f934c2a2e44cb24b6a1471dec
GOOGLE_CRASH_ID=Lakitu
VERSION_ID=65
BUG_REPORT_URL=https://crbug.com/new
PRETTY_NAME="Container-Optimized OS from Google"
VERSION=65
GOOGLE_METRICS_PRODUCT_ID=26
HOME_URL="https://cloud.google.com/compute/docs/containers/vm-image/"
ID=cos

$ uname -a
Linux gke-yolo-default-pool-a42e49fb-1b0m 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Swapping the node out for an Ubuntu node seems to correct the memory usage. The pod never uses more than 600 MiB of RAM according to kubectl top pod.

This is looking more and more like some kind of memory leak or misaccounting that's present in the 4.4 series kernel used in COS nodes but not in the 4.13 series used by Ubuntu nodes.

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

$ uname -a 
Linux gke-yolo-pool-1-cb926a0e-51cf 4.13.0-1008-gcp #11-Ubuntu SMP Thu Jan 25 11:08:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

berlincount · 2018-04-12T12:12:51Z

We seem to be having more success running our containers Ubuntu nodes ... so, I'd concur :)

bitglue · 2018-05-30T16:33:44Z

I'd like to share some observations, though I can't say I have a good solution to offer yet, other than to set a memory limit equal to the memory request for any pod that makes use of the file cache.

Perhaps it's just a matter of documenting the consequences of not having a limit set.

Or perhaps an explicit declaration of cache reservation should exist in the podspec, in lieu of assuming "inactive -> not important to reserve".

Another possibility I've not explored is cgroup soft limits, and/or a more heuristic based detection of memory pressure.

Contrary Interpretations of "Inactive"

Kubernetes seems to have an implicit belief that the kernel is finding the working set, and keeping it in the active LRU. Everything not in the working state goes in the inactive LRU and is reclaimable.

A quote from the documentation [emphasis added]:

The value for memory.available is derived from the cgroupfs instead of tools like free -m. This is important because free -m does not work in a container, and if users use the node allocatable feature, out of resource decisions are made local to the end user Pod part of the cgroup hierarchy as well as the root node. This script reproduces the same set of steps that the kubelet performs to calculate memory.available. The kubelet excludes inactive_file (i.e. # of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that memory is reclaimable under pressure.

Compare a comment from mm/workingset.c in the Linux kernel:

All that is known about the active list is that the pages have been accessed more than once in the past. This means that at any given time there is actually a good chance that pages on the active list are no longer in active use.

While both Kubernetes and Linux agree that the working set is in the active list, they disagree about where memory in excess of the working set goes. I'll show that Linux actually wants to minimize the size of the inactive list, putting all extra memory in the active list, as long as there's a process using the file cache enough for it to matter (which may not be the case if the workload on a node consists entirely of stateless web servers, for example).

The Dilemma

One running an IO workload on Kubernetes must:

set a memory limit equal to or less than the memory request for any pod that utilizes the file LRU list, or
accept that any IO workload will eventually exceed its memory request through normal and healthy utilization of the file page cache.

The Kernel Implementation

Note I'm not a kernel expert. These observations are based on my cursory study of the code.

When a page is first loaded, add_to_page_cache_lru() is called. Normally this adds the page to the inactive list, unless this is a "refault". More on that later.

Subsequent accesses to a page call mark_page_accessed() within mm/swap.c. If the page was on the inactive list it's moved to the active list, incrementing pgactivate in /proc/vmstat. Unfortunately this counter does not distinguish between the anonymous and file LRUs, but examining pgactivate in conjunction with nr_inactive_file and nr_active_file gives a clear enough picture. These same counters are available within memory.stat for cgroups as well.

Accessing a page twice is all that's required to get on the active list. If the inactive list is too big, there may not be enough room in the active list to contain the working set. If the inactive list is too small, pages may be pushed off the tail before they've had a chance to move to the active list, even if they are part of the working set.

mm/workingset.c deals with this balance. It forms estimates from the inactive file LRU list stats and maintains "shadow entries" for pages recently evicted from the inactive list. When add_to_page_cache_lru() is adding a page and it sees a shadow entry for that page it calls workingset_refault() and workingset_refault is incremented in /proc/vmstat. If that returns true then the page is promoted directly to the active list and workingset_activate in /proc/vmstat is incremented. It appears this code path does not increment pgactivate.

So a page accessed twice gets it added to the active list. What puts downward pressure on the active list?

During scans (normally by kswapd, but directly by an allocation if there are insufficient free pages), inactive_list_is_low() may return true. If it does, shrink_active_list() is called.

The comments to inactive_list_is_low() are insightful:

 * The inactive anon list should be small enough that the VM never has
 * to do too much work.
 *
 * The inactive file list should be small enough to leave most memory
 * to the established workingset on the scan-resistant active list,
 * but large enough to avoid thrashing the aggregate readahead window.
 *
 * Both inactive lists should also be large enough that each inactive
 * page has a chance to be referenced again before it is reclaimed.
 *
 * If that fails and refaulting is observed, the inactive list grows.
 *
 * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages
 * on this LRU, maintained by the pageout code. An inactive_ratio
 * of 3 means 3:1 or 25% of the pages are kept on the inactive list.
 *
 * total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB

So, the presence of refaults (meaning, pages are faulted, pushed off the inactive list, then faulted again) indicates the inactive list is too small, which means the active list is too big. If refaults aren't happening then the ratio of active:inactive is capped by a formula based on the total size of inactive + active. A larger cache favors a larger active list in proportion to the inactive list.

I believe (through I've not confirmed with experiment) that the presence of a large number of refaults could also mean there simply isn't enough memory available to contain the working set. The refaults will cause the inactive list to grow and the active list to shrink, causing Kubernetes to think there is less memory pressure, the opposite of reality!

calvix · 2018-06-13T16:20:15Z

+1 👍

fejta-bot · 2018-09-11T16:44:42Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wenjianhn · 2024-04-28T02:00:14Z

For OOMs that's related to dirty pages, try setting memory.high as 99% of memory.limit_in_bytes or memory.max as a workaround.
e.g.,

[a@container ~]# cat /sys/fs/cgroup/memory/memory.limit_in_bytes 
2147483648
[a@container ~]# cat /sys/fs/cgroup/memory/memory.high 
2126008320

Note that we are using a kernel that has backported memory.high to cgroup v1.

mborne · 2024-07-12T10:35:36Z

We have tried @wenjianhn 's workaround setting memory.high = 0.9 * memory.max on some Pods with good results (at 99%, it was to late to start cleaning cache)

We also noticed that Kubernetes 1.27: Quality-of-Service for Memory Resources (alpha) follows the same approach by improving the handling of memory.max, memory.min and memory.high.

rptaylor · 2024-07-25T01:33:25Z

Trying to summarize information on this for my own understanding... Any clarifications appreciated!

As far as I can tell, although it seems cgroup v2 alone may not fully address this, the memory QoS feature KEP 2570 introduced in k8s 1.22, which uses new features of cgroup v2, is meant to address this.
This feature was further improved in k8s 1.27.

A comment in the PR for that KEP says:

A concrete use case is #43916. The summary of that longstanding
issue is that some apps read many files into memory, filling the page cache. Eventually, this may result in cgroup OOM
even though the memory could be reclaimed. Some users have written workarounds like agents that proactively force
reclaim https://github.com/linchpiner/cgroup-memory-manager. The goal is that with this feature, this will not be
necessary, as memory.high (a bit before the limit) would be reached, and the memory would be reclaimed automatically.

However based on a careful reading of the changes in k8s 1.27, a Guaranteed pod (req=lim) no longer results in setting memory.high. So the workaround described earlier in this thread using the Guaranteed QoS would no longer have the intended effect if Memory QoS is enabled. More specifically, this shows that in k8s >= 1.27 with memory QoS enabled, pods should be Burstable or Best Effort in order for memory.high to be set. From what I understood, you need to have memory.high to trigger the PSI (Pressure Stall) mechanism to try to reclaim some memory from page caches.

The memory QoS feature gate remains alpha (off by default) in k8s 1.30.

tschuchortdev · 2024-10-31T22:24:04Z

My understanding is as follows:

This is issue is exclusively about EVICTION. Lots of people in this thread talk about OOM kills and apparently don't understand the difference, despite having been reminded multiple times. Hence the general confusion when people are talking about completely separate issues.
Page cache will never grow beyond the pod-spec memory.limit, both in cgroup v1 and v2, because page cache is always reclaimable and the Linux kernel will start to reclaim when the cgroup maximum limit is reached. Therefore setting request == limit should "fix" the issue in both cgroup v1 and v2, since the scheduler guarantees enough memory.available on the node for the sum of all container-spec memory.request
Using a memory.limit in the pod-spec that is somewhat higher than the memory.request will at least mitigate the issue by stopping "ill-behaved" applications from growing their page cache unboundedly. This does not "fix" the issue in every case, but should be good enough for most cases. The important thing is to have at least some reasonable memory.limit on every container.
Using hacky scripts to manually reclaim all page cache periodically also "fixes" the issue by preventing unbounded growth of page cache.
Using direct I/O also "fixes" the issue by simply not using page cache
Migrating to cgroup v2 alone will not change anything. While cgroup v2 offers new knobs to deal with page cache, you have to use them
Cgroup v2 offers the new memory.high setting which is a soft limit on memory. When the soft limit is reached, the kernel will try to reclaim pages but not trigger an OOM kill if it can not reclaim enough to go back under the soft limit. To utilize the memory.high setting in Kubernetes you have to use the new memory QoS feature. However, since we are dealing with node pressure evictions here and not OOM, I don't see how the memory.high setting does anything we couldn't do before. It will just cap your page cache usage at 0.8 * memory.limit instead of 1.0 * memory.limit and that's it.

However based on a careful reading of the changes in k8s 1.27, a Guaranteed pod (req=lim) no longer results in setting memory.high. I think this means the workaround described earlier in this thread using the Guaranteed QoS would no longer have the intended effect if Memory QoS is enabled. From what I understood, you need to have memory.high to trigger the PSI (Pressure Stall) mechanism to try to reclaim some memory from page caches.

PSI is an observability feature that merely measures the delays caused by reclaim. In my opinion it has no relevance here. The kernel's reclaim algorithm will trigger when cgroup v1's memory.limit or cgroup v2's memory.high or memory.max is reached. The only difference here between memory.high and memory.max is that memory.high will trigger earlier and not cause an OOM if the reclaim fails, but in both cases it will reclaim all page cache before failing.

TLDR: Slightly overcommit node memory available for scheduling. Use sensible memory.request and memory.limit values for all containers. If your application needs page cache to function, make room for it in your memory.request. If your nodes have swap enabled then cgroup v2's memory.high (i.e. Kubernetes memory QoS feature) together with PSI metrics can gracefully degrade your workload giving you more time to increase memory.request before an OOM kill but does nothing new to solve the node pressure eviction issue that we're talking about here.

emrahbecer · 2024-11-22T20:06:29Z

My understanding is as follows:

This is issue is exclusively about EVICTION. Lots of people in this thread talk about OOM kills and apparently don't understand the difference, despite having been reminded multiple times. Hence the general confusion when people are talking about completely separate issues.

Page cache will never grow beyond the pod-spec memory.limit, both in cgroup v1 and v2, because page cache is always reclaimable and the Linux kernel will start to reclaim when the cgroup maximum limit is reached. Therefore setting request == limit should "fix" the issue in both cgroup v1 and v2, since the scheduler guarantees enough memory.available on the node for the sum of all container-spec memory.request

Using a memory.limit in the pod-spec that is somewhat higher than the memory.request will at least mitigate the issue by stopping "ill-behaved" applications from growing their page cache unboundedly. This does not "fix" the issue in every case, but should be good enough for most cases. The important thing is to have at least some reasonable memory.limit on every container.

Using hacky scripts to manually reclaim all page cache periodically also "fixes" the issue by preventing unbounded growth of page cache.

Using direct I/O also "fixes" the issue by simply not using page cache

Migrating to cgroup v2 alone will not change anything. While cgroup v2 offers new knobs to deal with page cache, you have to use them

Cgroup v2 offers the new memory.high setting which is a soft limit on memory. When the soft limit is reached, the kernel will try to reclaim pages but not trigger an OOM kill if it can not reclaim enough to go back under the soft limit. To utilize the memory.high setting in Kubernetes you have to use the new memory QoS feature. However, since we are dealing with node pressure evictions here and not OOM, I don't see how the memory.high setting does anything we couldn't do before. It will just cap your page cache usage at 0.8 * memory.limit instead of 1.0 * memory.limit and that's it.

However based on a careful reading of the changes in k8s 1.27, a Guaranteed pod (req=lim) no longer results in setting memory.high. I think this means the workaround described earlier in this thread using the Guaranteed QoS would no longer have the intended effect if Memory QoS is enabled. From what I understood, you need to have memory.high to trigger the PSI (Pressure Stall) mechanism to try to reclaim some memory from page caches.

PSI is an observability feature that merely measures the delays caused by reclaim. In my opinion it has no relevance here. The kernel's reclaim algorithm will trigger when cgroup v1's memory.limit or cgroup v2's memory.high or memory.max is reached. The only difference here between memory.high and memory.max is that memory.high will trigger earlier and not cause an OOM if the reclaim fails, but in both cases it will reclaim all page cache before failing.

TLDR: Slightly overcommit node memory available for scheduling. Use sensible memory.request and memory.limit values for all containers. If your application needs page cache to function, make room for it in your memory.request. If your nodes have swap enabled then cgroup v2's memory.high (i.e. Kubernetes memory QoS feature) together with PSI metrics can gracefully degrade your workload giving you more time to increase memory.request before an OOM kill but does nothing new to solve the node pressure eviction issue that we're talking about here.

Are you sure the page cache (active files) won't cause OOM kills? When container_memory_working_set_bytes metric reaches the container memory limit, the container is expected to be OOM killed. This metric takes account into active page caches. That's why some people here are desperately trying to drop caches for IO bound pods. Setting request=limit may help in node evictions but won't help for OOM kills.

tschuchortdev · 2024-11-25T21:52:00Z

When container_memory_working_set_bytes metric reaches the container memory limit, the container is expected to be OOM killed.

container_memory_working_set_bytes is a cAdvisor metric, it has nothing to do with the Linux OOM killer. We must rid ourselves of this Kubernetes view point and look at what the kernel does.

Clearly, inactive pages will always be reclaimed before an OOM kill. The kernel tries to keep a balance between active and inactive lists and adjusts them periodically, moving pages back from active to inactive. It should be noted here that the kernel is rather conservative and keeps pages in active list that haven't been accessed for some time if there is no pressure to balance the lists, so the active list can be larger than the actual working set. It is my understanding that a page on the active list has to be unreferenced for two consecutive "cycles" to move back to the inactive list via asynchronous reclaim. An OOM kill happens when a page fault can not be satisfied from the kernel's list of free pages and the synchronous direct reclaim fails to evict enough pages. The question is then, does the direct reclaim evict pages directly from the active list or at least rebalance it so that they can be evicted from the inactive list in the next scan?

The entry point for direct reclaim is do_try_to_free_pages which calls shrink_zones, which further calls mem_cgroup_soft_limit_reclaim and shrink_node. Further following those two code paths shows that they both eventually end up at shrink_node_memcg. In this function, the for_each_evictable_lru loop is our first hint that active file lru is considered evictable and will indeed be scanned for page reclaim. At the end, shrink_active_list function is called, which appears to move pages from active lru back to inactive lru, where they should be reclaimed in the next iteration of shrink_zones (the reclaim algorithm is repeatedly executed with increasing priority/pressure when the previous iteration failed to reclaim enough pages).

Here's an older blog post from 2015 that also shows how shrink_active_list is called in the direct reclaim code path for active file lru:
https://chengyihe.wordpress.com/2015/12/04/kernel-mm-shrink_lruvec/
https://chengyihe.wordpress.com/2015/12/05/kernel-mm-shrink_active_list/

And here is StackOverflow post where a user creates a Kernel patch specifically to avoid reclaim of active page cache.

While I'm far from an expert on the Linux reclaim algorithm, given this evidence I do believe that active file pages will generally be reclaimed before an OOM kill is triggered, absent new information.

I can imagine a few reasons why an OOM may occur while there is still active file usage reported: wrong NUMA zone, take too long to write back (file system locks preventing writeout), memory fragmentation, slow metrics collection, etc.

That's why some people here are desperately trying to drop caches for IO bound pods

In any case, I don't see what goal this is supposed to achieve. If this page cache is on the inactive lru, then it will surely be reclaimed before an OOM, I think everyone can agree on that. If it is on the active lru and part of the true working set, then dropping it manually will do nothing as it will be faulted in again almost immediately. If this page cache is "active" but not truly used, then the kernel will eventually move it back to inactive when balancing the lrus, where again it can be reclaimed.

emrahbecer · 2024-11-26T09:10:24Z

When container_memory_working_set_bytes metric reaches the container memory limit, the container is expected to be OOM killed.

container_memory_working_set_bytes is a cAdvisor metric, it has nothing to do with the Linux OOM killer. We must rid ourselves of this Kubernetes view point and look at what the kernel does.

Clearly, inactive pages will always be reclaimed before an OOM kill. The kernel tries to keep a balance between active and inactive lists and adjusts them periodically, moving pages back from active to inactive. It should be noted here that the kernel is rather conservative and keeps pages in active list that haven't been accessed for some time if there is no pressure to balance the lists, so the active list can be larger than the actual working set. It is my understanding that a page on the active list has to be unreferenced for two consecutive "cycles" to move back to the inactive list via asynchronous reclaim. An OOM kill happens when a page fault can not be satisfied from the kernel's list of free pages and the synchronous direct reclaim fails to evict enough pages. The question is then, does the direct reclaim evict pages directly from the active list or at least rebalance it so that they can be evicted from the inactive list in the next scan?

The entry point for direct reclaim is do_try_to_free_pages which calls shrink_zones, which further calls mem_cgroup_soft_limit_reclaim and shrink_node. Further following those two code paths shows that they both eventually end up at shrink_node_memcg. In this function, the for_each_evictable_lru loop is our first hint that active file lru is considered evictable and will indeed be scanned for page reclaim. At the end, shrink_active_list function is called, which appears to move pages from active lru back to inactive lru, where they should be reclaimed in the next iteration of shrink_zones (the reclaim algorithm is repeatedly executed with increasing priority/pressure when the previous iteration failed to reclaim enough pages).

Here's an older blog post from 2015 that also shows how shrink_active_list is called in the direct reclaim code path for active file lru: https://chengyihe.wordpress.com/2015/12/04/kernel-mm-shrink_lruvec/ https://chengyihe.wordpress.com/2015/12/05/kernel-mm-shrink_active_list/

And here is StackOverflow post where a user creates a Kernel patch specifically to avoid reclaim of active page cache.

While I'm far from an expert on the Linux reclaim algorithm, given this evidence I do believe that active file pages will generally be reclaimed before an OOM kill is triggered, absent new information.

I can imagine a few reasons why an OOM may occur while there is still active file usage reported: wrong NUMA zone, take too long to write back (file system locks preventing writeout), memory fragmentation, slow metrics collection, etc.

That's why some people here are desperately trying to drop caches for IO bound pods

In any case, I don't see what goal this is supposed to achieve. If this page cache is on the inactive lru, then it will surely be reclaimed before an OOM, I think everyone can agree on that. If it is on the active lru and part of the true working set, then dropping it manually will do nothing as it will be faulted in again almost immediately. If this page cache is "active" but not truly used, then the kernel will eventually move it back to inactive when balancing the lrus, where again it can be reclaimed.

Thanks for the detailed info.
I agree the Linux kernel has a built-in OOM kill feature but are you sure the Kubernetes doesn't have another OOM kill feature on top of Linux kernel?
From the documentation, it says the kubelet on nodes will monitor the cadvisor metric "container_memory_working_set_bytes" and if this metric reaches the container's limit, the kubelet will terminate the container (OOM Kill). So according to documentation Kubernetes can definitely trigger an OOM kill. I didn't read the source code of Kubernetes though to confirm the documentation.

tschuchortdev · 2024-11-26T14:41:29Z

Can you link to that documentation page? I'm not aware of any Kubernetes-controlled OOM tool (besides pod evictions). As far as I know, actual OOM kills are handled transparently and exclusively by the operating system (through cgroups in linux). My guess is that they either mean a pod eviction (which directly tracks cAdvisor's container_memory_working_set_bytes iirc) and not a process OOM kill or perhaps they simply forgot about the influence of page cache on container_memory_working_set_bytes since very few workloads actually use enough of it to be relevant.

Anyone who is interested in Kubernetes OOMs should definitely read this excellent blog post. There is a flow chart at the end showing all the different OOM scenarios. (But don't forget that the blog post is referring to outdated cgroups v1 in many places).

When pod evictions are the problem then drop_caches can actually help you because the kubelet is unaware that there is reclaimable page cache and will evict pods prematurely. Of course, the better solution would be to use the cgroup mechanisms to put memory pressure on processes and force them to reclaim page cache before node allocatable memory shrinks below the eviction threshold. Setting kubelet config enforceNodeAllocatable: [pods] and putting a memory.high cgroup limit on the pods.slice cgroup should do it.

emrahbecer · 2024-11-26T15:31:46Z

Can you link to that documentation page? I'm not aware of any Kubernetes-controlled OOM tool (besides pod evictions). As far as I know, actual OOM kills are handled transparently and exclusively by the operating system (through cgroups in linux). My guess is that they either mean a pod eviction (which directly tracks cAdvisor's container_memory_working_set_bytes iirc) and not a process OOM kill or perhaps they simply forgot about the influence of page cache on container_memory_working_set_bytes since very few workloads actually use enough of it to be relevant.

Anyone who is interested in Kubernetes OOMs should definitely read this excellent blog post. There is a flow chart at the end showing all the different OOM scenarios. (But don't forget that the blog post is referring to outdated cgroups v1 in many places).

When pod evictions are the problem then drop_caches can actually help you because the kubelet is unaware that there is reclaimable page cache and will evict pods prematurely. Of course, the better solution would be to use the cgroup mechanisms to put memory pressure on processes and force them to reclaim page cache before node allocatable memory shrinks below the eviction threshold. Setting kubelet config enforceNodeAllocatable: [pods] and putting a memory.high cgroup limit on the pods.slice cgroup should do it.

This is the official documentation of an oom killed pod https://kubernetes.io/docs/tasks/configure-pod-container/assign-memory-resource/#exceed-a-container-s-memory-limit

It says a pod becomes a candidate for termination and killed if it reaches the memory limit. However it doesn't depict how a container is evaluated to have reached its limit (which metric?) or who actually kills it (kernel or kubelet?)

When I google "container_memory_working_set_bytes and oom killer" I see non-official articles saying that the container_memory_working_set_bytes is monitored and if exceeded the container would be killed by kubelet. May be this is one of those internet myths or misinformation spread.

Wish the official documents were more clear.
I'll read the article you've shared. Seems a good source.

dashpole · 2024-11-26T16:01:32Z

+1 to #43916 (comment). Setting memory.high to be slightly lower than the eviction threshold should help ensure active pages that can be reclaimed are reclaimed before eviction kicks in, and solve this problem. It would need to be done on the allocatable cgroup.

cc @dchen1107

SuperQ · 2024-11-26T19:57:23Z

When I google "container_memory_working_set_bytes and oom killer" I see non-official articles saying that the container_memory_working_set_bytes is monitored and if exceeded the container would be killed by kubelet. May be this is one of those internet myths or misinformation spread.

To my knowledge, the contents of these posts propagate myths.

I basically agree with what @tschuchortdev has been saying in this thread.

In my experience, container_memory_working_set_bytes is a highly misleading metric and has been causing a large amount of confusion, myth, and miss-management of memory controls for a very long time now.

The really difficult thing here is that memory accounting is very complicated. More complicated than one number can surface to users.

GitBenjamin · 2024-12-13T06:33:00Z

In any case, I don't see what goal this is supposed to achieve. If this page cache is on the inactive lru, then it will surely be reclaimed before an OOM, I think everyone can agree on that. If it is on the active lru and part of the true working set, then dropping it manually will do nothing as it will be faulted in again almost immediately. If this page cache is "active" but not truly used, then the kernel will eventually move it back to inactive when balancing the lrus, where again it can be reclaimed.

@tschuchortdev
Hi, tschuchortdev, your comments are excellent.
I see that the official document about active_file memory is not considered as available memory.
There is ambiguity between official description and the conclusion of you. You think the page cache is "active" but not truly used, then the kernel will reclaimed it. Opposite, official think the kubelet treats "active_file memory" areas as not reclaimable.
I quite agree with you, but i hope there is more evidence to support it, can you provide me with some help?

tschuchortdev · 2024-12-13T09:12:29Z

@GitBenjamin

You think the page cache is "active" but not truly used, then the kernel will reclaimed it. Opposite, official think the kubelet treats "active_file memory" areas as not reclaimable.

Both can be true. It is my belief (according to above investigation) that active_file pages will always be reclaimed before an OOM. It is also known that the kubelet adds up all container_memory_working_set_bytes metrics to get the node memory usage, which includes active_file usage (and slab_reclaimable too), so the Kubelet will "prematurely" trigger a pod eviction even though much of the memory usage it is counting could be reclaimed by the kernel when needed.

i hope there is more evidence to support it, can you provide me with some help?

What sort of help are you looking for? In the blog post series I linked earlier there should be a program that reserves page cache, if I remember correctly. You could use that to test the behaviour of OOM killer and kubelet pod evictions experimentally.

GitBenjamin · 2024-12-13T09:32:56Z

Thanks, i will read the blog carefully, and do some tests. Your answer is help!

so the Kubelet will "prematurely" trigger a pod eviction even though much of the memory usage it is counting could be reclaimed by the kernel when needed.

By the way, is this your guess around the official verdict, or is it a proven conclusion?

Is it possible for us to send kubelet to reclaim active_file in advance, without reducing the amount of memory.available to pod?

tschuchortdev · 2024-12-13T11:13:21Z

I think the fact that container_memory_working_set_bytes is the relevant metric for kubelet is official documentation. The problem with active_page cache and premature pod evictions is a conclusion from that.

Is it possible for us to send kubelet to reclaim active_file in advance, without reducing the amount of memory.available to pod?

active_file can only be reclaimed by the kernel, not by kubelet. Cgroups can be used to make the kernel reclaim memory. You could set memory request == limit in the kubernetes manifest for all containers (which may perhaps be inefficient) or you could put a memory.high cgroup setting on the ancestor cgroup of all pods, as outlined earlier (I don't think there's a way to do this through Kubernetes. You'd have to do it on the host manually). The latter would still allow all containers to use as much page cache as they want, as long as there is no overall memory contention, as well as allow them to go over the threshold temporarily until the kubelet acts with a (deserved) pod eviction.

derekwaynecarr assigned derekwaynecarr and sjenning Mar 31, 2017

derekwaynecarr added the sig/node label Mar 31, 2017

vdavidoff mentioned this issue Mar 31, 2017

kubernetes 1.3, systemd 229, memory working_set is radically different from free google/cadvisor#1529

Open

k8s-ci-robot added the lifecycle/stale label Dec 22, 2017

k8s-ci-robot added lifecycle/rotten and removed lifecycle/stale labels Jan 21, 2018

k8s-ci-robot removed the lifecycle/rotten label Feb 16, 2018

mstemm mentioned this issue Mar 29, 2018

File-related output memory growth on k8s falcosecurity/falco#338

Closed

r7vme mentioned this issue Jun 13, 2018

Set eviction threshold for memory giantswarm/k8scloudconfig#390

Merged

This was referenced Jul 26, 2024

reuse node id when registering services owncloud/ocis#9656

Merged

[oCIS] Scaling oCIS in kubernetes causes requests to fail Timebox 8PD owncloud/ocis#8589

Open

craig-seeman mentioned this issue Aug 2, 2024

Memory utilization question - possible leak / gc issue? spegel-org/spegel#546

Closed

cdhowie mentioned this issue Aug 14, 2024

Memory leak with simple shuffle playout savonet/liquidsoap#4058

Closed

AndyTitu mentioned this issue Sep 27, 2024

High memory usage 1Password/onepassword-operator#149

Closed

howardjohn mentioned this issue Oct 9, 2024

CNI highly skewed by pagecaches istio/istio#53493

Closed

james-munson mentioned this issue Oct 11, 2024

[BUG][v1.7.x] Excessive memory consumption caused by RWX volumes / ganesha.nfsd longhorn/longhorn#8523

Closed

This was referenced Oct 16, 2024

Constant memory growth with the file backend nats-io/nats-server#6010

Closed

AWS Fargate Memory slowly increasing/leak over time nats-io/nats-server#6006

Closed

katexochen mentioned this issue Oct 23, 2024

node-installer: remove resource limits edgelesssys/contrast#948

Merged

wkloucek mentioned this issue Oct 23, 2024

storage-users high memory usage when S3ng multipart uploads are enabled owncloud/ocis#10398

Open

AntonioMaccarini mentioned this issue Nov 19, 2024

[Bug]: Memory usage of PostgreSQL is growing constantly? cloudnative-pg/cloudnative-pg#6111

Closed

4 tasks

pkoutsovasilis mentioned this issue Dec 9, 2024

Python and Go init/sidecar containers being OOMKilled with default settings open-telemetry/opentelemetry-operator#3479

Open

phillebaba mentioned this issue Feb 20, 2025

Fine-Grained Resource Allocation for Spegel DaemonSets spegel-org/spegel#718

Closed

atzoum mentioned this issue Mar 24, 2025

feat: include rss memory in calculated mem stats rudderlabs/rudder-go-kit#760

Merged

1 task

kubelet counts active page cache against memory.available (maybe it shouldn't?) #43916

kubelet counts active page cache against memory.available (maybe it shouldn't?) #43916

Comments

vdavidoff commented Mar 31, 2017

vdavidoff commented Mar 31, 2017 • edited Loading

vdavidoff commented Apr 1, 2017

sjenning commented Apr 3, 2017

sjenning commented Apr 3, 2017

fejta-bot commented Dec 22, 2017

fejta-bot commented Jan 21, 2018

berlincount commented Feb 16, 2018

mgomezch commented Feb 16, 2018

berlincount commented Feb 20, 2018

devopsprosiva commented Mar 1, 2018

berlincount commented Mar 16, 2018

treacher commented Mar 21, 2018

berlincount commented Mar 21, 2018

devopsprosiva commented Mar 21, 2018

thefirstofthe300 commented Mar 30, 2018

thefirstofthe300 commented Mar 30, 2018

berlincount commented Apr 12, 2018

bitglue commented May 30, 2018

Contrary Interpretations of "Inactive"

The Dilemma

The Kernel Implementation

calvix commented Jun 13, 2018 • edited Loading

fejta-bot commented Sep 11, 2018

wenjianhn commented Apr 28, 2024

mborne commented Jul 12, 2024

rptaylor commented Jul 25, 2024 • edited Loading

tschuchortdev commented Oct 31, 2024

emrahbecer commented Nov 22, 2024 • edited Loading

tschuchortdev commented Nov 25, 2024

emrahbecer commented Nov 26, 2024

tschuchortdev commented Nov 26, 2024 • edited Loading

emrahbecer commented Nov 26, 2024

dashpole commented Nov 26, 2024

SuperQ commented Nov 26, 2024

GitBenjamin commented Dec 13, 2024

tschuchortdev commented Dec 13, 2024

GitBenjamin commented Dec 13, 2024

tschuchortdev commented Dec 13, 2024

vdavidoff commented Mar 31, 2017 •

edited

Loading

calvix commented Jun 13, 2018 •

edited

Loading

rptaylor commented Jul 25, 2024 •

edited

Loading

emrahbecer commented Nov 22, 2024 •

edited

Loading

tschuchortdev commented Nov 26, 2024 •

edited

Loading