Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet counts active page cache against memory.available (maybe it shouldn't?) #43916

Open
vdavidoff opened this issue Mar 31, 2017 · 125 comments
Open
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@vdavidoff
Copy link

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
active_file
inactive_file
working_set
WorkingSet
cAdvisor
memory.available


Is this a BUG REPORT or FEATURE REQUEST? (choose one):
We'll say BUG REPORT (though this is arguable)

Kubernetes version (use kubectl version):
1.5.3

Environment:

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release):
    NAME="Ubuntu"
    VERSION="14.04.5 LTS, Trusty Tahr"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 14.04.5 LTS"
    VERSION_ID="14.04"

  • Kernel (e.g. uname -a):
    Linux HOSTNAME_REDACTED 3.13.0-44-generic kube-up: fix gcloud version check #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

  • Others:

What happened:
A pod was evicted due to memory pressure on the node, when it appeared to me that there shouldn't have been sufficient memory pressure to cause an eviction. Further digging seems to have revealed that active page cache is being counted against memory.available.

What you expected to happen:
memory.available would not have active page cache counted against it, since it is reclaimable by the kernel. This also seems to greatly complicate a general case for configuring memory eviction policies, since in a general sense it's effectively impossible to understand how much page cache will be active at any given time on any given node, or how long it will stay active (in relation to eviction grace periods).

How to reproduce it (as minimally and precisely as possible):
Cause a node to chew up enough active page cache that the existing calculation for memory.available trips a memory eviction threshold, even though the threshold would not be tripped if the page cache - active and inactive - were freed for anon memory.

Anything else we need to know:
I discussed this with @derekwaynecarr in #sig-node and am opening this issue at his request (conversation starts here).

Before poking around on Slack or opening this issue, I did my best to read through the 1.5.3 release code, Kubernetes documentation, and cgroup kernel documentation to make sure I understood what was going on here. The short of it is that I believe this calculation:

memory.available := node.status.capacity[memory] - node.stats.memory.workingSet

Is using cAdvisor's value for working set, which if I traced the code correctly, amounts to:

$cgroupfs/memory.usage_in_bytes - total_inactive_file

Where, according to my interpretation of the kernel documentation, usage_in_bytes includes all page cache:

$kernel/Documentation/cgroups/memory.txt

 
The core of the design is a counter called the res_counter. The res_counter
tracks the current memory usage and limit of the group of processes associated
with the controller.
 
...
 
2.2.1 Accounting details
 
All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.

Ultimately my issue is concerning how I can set generally applicable memory eviction thresholds if active page cache is counting against those, and there's no way to to know (1) generally how much page cache will be active across a cluster's nodes, to use as part of general threshold calculations (2) how long active page cache will stay active, to use as part of eviction grace period calculations.

I understand that there are many layers here and that this is not a particularly simple problem to solve generally correctly, or even understand top to bottom. So I apologize up front if any of my conclusions are incorrect or I'm missing anything major, and I appreciate any feedback you all can provide.

As requested by @derekwaynecarr: cc @sjenning @derekwaynecarr

@derekwaynecarr derekwaynecarr added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Mar 31, 2017
@vdavidoff
Copy link
Author

vdavidoff commented Mar 31, 2017

I'm trying to better understand how the kernel deals with the page cache in terms of active and inactive pages, and I may have just discovered that it actually does not reclaim active page cache (for example, if you echo 3 to drop_caches), or at least doesn't know what to do with it if there is no swap available (and as recommended by the Kubernetes documentation, my nodes have swap disabled). So maybe I'm just totally wrong here, and I need to better understand specifically how the system considers page cache entries active, and work back from there.

...well, with another drop_caches test on a machine with 1.7GB Active(file) reporting, a substantial amount did get dumped, dropping Active(file) to ~134MB, so maybe I'm still onto something here.

@vdavidoff
Copy link
Author

At this point in my research I'm wondering if drop_caches releases active page cache because it actually first moves pages to the inactive_list, then evicts from the inactive_list. And if something like that is happening, then maybe it's not possible to determine what from the active_list cold be dropped without iterating over it, which is not something cadvisor or kubelet would do. I guess I was hoping there'd be some stats exposed somewhere that could be used as a heuristic to determine with some reasonable approximation what could be dropped without having to do anything else, but maybe that just doesn't exist. But if that's the case, then it I wonder how it's possible to use memory eviction policies effectively.

At this point I'm sufficiently dizzy from reading various source code and documents, and I'm just going to shut up now.

@sjenning
Copy link
Contributor

sjenning commented Apr 3, 2017

@vdavidoff the memory management convoluted for sure.

To start, I would agree that subtracting the active pages from available isn't a great heuristic. It is very pessimistic about reclaimable memory and how much of the active list could be reclaimed without pushing the system into a thrashing state.

Doing a sync;echo 1 > drop_caches will free as much pagecache as possible. Keep in mind that this doesn't drop dirty or locked pages from the cache, hence the sync before the drop to maximize the cleanness of the cache and maximize reclaim.

In order to get a good value for available, which is "capacity - how much the system needs in order to run without thrashing, as indicated by high major fault rates", one would need to do something like

sync
echo 1 > drop_caches
sleep 10 # let processes fault their active working set back in
use calculation we currently use, as active_pages will now be closer to the value of "memory required to avoid thrashing"

This is not workable however, because it trashes the page cache system wide and would cause periodic performance degradation.

So the question is "how can we determine the minimum amount of memory needed to support the currently running workload without thrashing". Not a easy question to answer.

@sjenning
Copy link
Contributor

sjenning commented Apr 3, 2017

And to answer your question about how exactly drop_caches works, basically like this

for each superblock
  for each cached inode
    invalidate all page mappings to the inode and release the pages

It actually works backward from the filesystem to find the pages that can be freed. It doesn't consider the LRU i.e. active/invactive pages lists

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2018
@berlincount
Copy link

/remove-lifecycle rotten

it seems like we're affected by this problem as well. with tightly packed containers, long running jobs involving heavy disk I/O sporadically fail.

take this example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: democlaim
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ssd
  resources:
    requests:
      storage: 1.2Ti
---
apiVersion: batch/v1
kind: Job
metadata:
  name: demo
spec:
  template:
    spec:
      containers:
      - name: demo
        image: ubuntu
        command: ["bash",  "-c", "apt-get update ; apt-get install -y wget ; wget -O /data/zstd.deb https://packages.shopify.io/shopify/public/packages/ubuntu/xenial/zstd_1.2.0-0shopify_amd64.deb/download ; wget -O /data/libzstd.deb https://packages.shopify.io/shopify/public/packages/ubuntu/xenial/libzstd1_1.2.0-0shopify_amd64.deb/download ; dpkg -i /data/libzstd.deb /data/zstd.deb ; echo 'KLUv/QRQtR8ApnmvQMAaBwCp6S2VEGAQoIMR3DIbNd4HvrRTZ9cQVwgYX19vUMlci2xnmLLgkNZaGsZmRkAEFuSmnbH8UpgxwUmkdx6yAJoAhwDu8W4cEEiofKDDBIa1pguh/vv4eVH7f7qHvH1N93OmnQ312X+6h8rb+nS0n/eh6s+rP5MZwQUC7cOaJEJuelbbWzpqfZ6advxPlOv6Ha8/D2jCPwQceFDCqIIDoAAASmhMkDoVisCA6fmpJd0HKRY7+s/P0QkkGjVYP2dNCGq1WHe1XK2WqxUkwdVCGetBQRRYLBbrNFTEjlTgMLEiZmLIRYgWT9MzTQ+Uo2AUoWhAWFQB7iFvo6YSZNHNSY5U9n92D5W3/6d7P2+jZv8DWFs0oHjNZLU27B4qb/9P93gfavaf7iEWSYETxQOO2GqrJfH2Nd1b3X66d14DQbo8veCxY7W1GR1/uP2ne8jb16d7ADEH3qhAALGBYQPxeek4lUJjBMlpJuuhC/H8R9Ltp3vnRc2n/+6hEm2jyhSYMUT1hcBqq935072ft1GfHajbf7qH523U59PP7qESLQIiSTJcec1k7eF/uvfzNuqz/7N7qLz9P1FkmMw4ovLCtNo6wDtvoz77T/dQeVuf7v3sIIR5qM/+0z1U3r4+He28D/X9p3t43tb8zw91++fBcKaIhddM1k3uIW9fn+7sHur2/3QPk8UFlROB1dYxtlHT/Wf3ULf/Z1ZoNs6IyBb85CSerVSKFAx41MB6w6/0P91D3r4+3Xkf6vbvkqHlgQOYnNdMVhOS7p23Nf+ze6jbf7p3BotKKQwMvPuxqwxtQZGGqcNqax0+2vGxUwUQqBInORbAsCc4/utsIMHIjtdMVpvD+red51xycsNhyg2m1VZb0r/NqyyLUo8lW8t5/jf62eehbv9P9/C8jfp0/9kZiGP0JkILoBIw7KSCLKhikCMAAQHAgoA+Yk8AAjQQEoAACAxBQFcEQGCAcCuCACAAiAYIqDACVyuMeO/lZnP49YuJifET/DqMhFOzZUJDc6W5kGD1OGhhORIUxs/EoaGhcmhoLdYShiQhNGm//E0IDUWEMnroe0JoaDQcaLcMp63yfIuKck/X8QoCDbQRklMBggIErDt1qfySehKEwet2c/0/MLRMEH5ZAxq+RlpgiN8BMOMwt+HwGvF3W2aM0KjIUT/Em+cFyAEQMGUIEjCG7YLmcKmhA6ySpQ7QIJao+Tr/Ygp+MGmXtAyBBdHa63eY+W9lcdCVFioqTUB7WITH0ZAfgx5TXMzXgcmge1Iy3CK3WCk0xRLDTbllx2Ar9yhMpUkwoEDYJnasQZrXT/4JjLxAaWX9iX77a1KsfrFu5j8fRZmwDg==' | base64 -d > /data/1TiB_of_zeroes.tar.zst.zst.zst ; echo -n 'Unpacking ' ; zstd -d -T4 < /data/1TiB_of_zeroes.tar.zst.zst.zst | zstd -d -T4 | zstd -d -T4 | tar -C /data -xvf - && echo successful. || echo failed."]
        securityContext:
          privileged: false
        volumeMounts:
          - name: data
            mountPath: /data
            readOnly: false
        resources:
          requests:
            cpu: "4"
            memory: "1Gi"
          limits:
            cpu: "4"
            memory: "1Gi"
      restartPolicy: Never
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: democlaim
  backoffLimit: 0

Test this with:

kubectl create namespace tarsplosion
kubectl --namespace tarsplosion create -f ./demo.yml
kubectl --namespace tarsplosion logs job/demo --follow

the latter command might take a moment to become available.

The job tries to unpack 1TiB of zeroes (triple-compressed with zstd) - and
apparently fails because of memory exhaustion by buffers filled by tar.

There seems to be the problem like
https://serverfault.com/questions/704443/tar-uses-too-much-memory-for-its-buffer-workaround

  • the job only fails sometimes, but then in a nasty fashion.

The zstd used is a vanilla 1.2.0 packaged for xenial - previous versions are
not multithreaded and have a slightly different file format.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 16, 2018
@mgomezch
Copy link

We don't currently have a reliable reproducer for this, but we often hit this when restoring large PostgreSQL backups with pg_basebackup. A particularly horrible but effective hack to help the backup restore process complete is to exec into the pod and sync; echo 1 > drop_caches repeatedly as suggested above (it also helps to sigstop/cont the backup process while flushing the cache).

Is there a good way to fix this without a change to the kernel's implementation of cgroups, though? Should this perhaps rather be a kernel bug?

@berlincount
Copy link

This test load of mine seems to be killed reliably:

[..]
Unpacking 1TiB_of_zeroes
tar: 1TiB_of_zeroes: Wrote only 2560 of 10240 bytes
$ kubectl --namespace tarsplosion get events
LAST SEEN   FIRST SEEN   COUNT     NAME                          KIND      SUBOBJECT               TYPE      REASON                 SOURCE                                            MESSAGE
12m         12m          1         demo-xmnr6.1514cb1745ab1ac3   Pod                               Normal    SandboxChanged         kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Pod sandbox changed, it will be killed and re-created.
11m         11m          1         demo-xmnr6.1514cb1e485790c7   Pod       spec.containers{demo}   Normal    Killing                kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Killing container with id docker://demo:Need to kill Pod
11m         11m          1         demo.1514cb1e8c67ab56         Job                               Warning   BackoffLimitExceeded   job-controller                                    Job has reach the specified backoff limit
$ kubectl --namespace tarsplosion get pods -a
NAME         READY     STATUS      RESTARTS   AGE
demo-xmnr6   0/1       OOMKilled   0          17h

other region:

[..]
Unpacking 1TiB_of_zeroes
tar: 1TiB_of_zeroes: Wrote only 512 of 10240 bytes
$ kubectl --namespace tarsplosion get events
LAST SEEN   FIRST SEEN   COUNT     NAME                          KIND      SUBOBJECT               TYPE      REASON                 SOURCE                                            MESSAGE
11m         11m          1         demo-xmnr6.1514cb1745ab1ac3   Pod                               Normal    SandboxChanged         kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Pod sandbox changed, it will be killed and re-created.
11m         11m          1         demo-xmnr6.1514cb1e485790c7   Pod       spec.containers{demo}   Normal    Killing                kubelet, gke-tier1-central-pool-3-3f529658-8wwx   Killing container with id docker://demo:Need to kill Pod
11m         11m          1         demo.1514cb1e8c67ab56         Job                               Warning   BackoffLimitExceeded   job-controller                                    Job has reach the specified backoff limit
$ kubectl --namespace tarsplosion get pods -a
NAME         READY     STATUS      RESTARTS   AGE
demo-s7tzf   0/1       OOMKilled   0          17h

these events occur around the same time, I guess some other task is contending for resources, maybe causing buffer growth pushing it over the limit due to I/O slowdown of the PV? hard to guess.

@devopsprosiva
Copy link

I ran into the same issue today. On node with 32GB memory, 16+GB is cached. When the memory used + cache exceeded 29GB (~90% of 32GB), the kubelet tried to evict all the pods which shouldn't have happened since the node still had close to 50% of memory available, although in cache. Is there a fix to this issue?

@berlincount
Copy link

As part of other investigations we've been recommended to use https://github.com/Feh/nocache to wrap the corresponding calls, which helped a fairly big amount :)

@treacher
Copy link

Also having this problem. We had 53GB of available memory and 0.5GB free. 52.5GB is in buff/cache and it starts trying to kill pods due to SystemOOM.

@berlincount
Copy link

If it is indeed expected behaviour it certainly seems to surprise some people in nasty ways ...

@devopsprosiva
Copy link

This is not expected behavior. The OS caching the memory has been around for a long time. Any app looking at memory usage should consider the cached memory. Using nocache is not an ideal solution either. Is there anyway we can bump up the severity/need on this issue? We're planning to go into production soon but can't without this issue getting fixed

@thefirstofthe300
Copy link

I've been digging in here trying to figure out why the workload @berlincount provided was OOMing. Given the simplicity, it just seemed odd.

Using the exact Job spec provided by Andreas, I spun up a k8s cluster, ran the job, and watched the kernel stats for the pod. The kernel was performing sanely for the most part. When it started to come under memory pressure, it started evicting from the page cache. Eventually, the page cache values approached zero.

The following is the output of the cgroup's memory.stat about thirty seconds before the OOM.

cache 139264
rss 3411968
rss_huge 0
mapped_file 61440
dirty 0
writeback 0
swap 0
pgpgin 115745935
pgpgout 115745068
pgfault 418117
pgmajfault 46033
inactive_anon 0
active_anon 3411968
inactive_file 24576
active_file 114688
unevictable 0
hierarchical_memory_limit 1073741824
hierarchical_memsw_limit 2147483648
total_cache 131072
total_rss 3411968
total_rss_huge 0
total_mapped_file 61440
total_dirty 0
total_writeback 0
total_swap 0
total_pgpgin 115745935
total_pgpgout 115745070
total_pgfault 418117
total_pgmajfault 46033
total_inactive_anon 0
total_active_anon 3411968
total_inactive_file 16384
total_active_file 0
total_unevictable 0

and this

[ 6971.999289] memory: usage 1048576kB, limit 1048576kB, failcnt 38707874
[ 6972.005986] memory+swap: usage 1048576kB, limit 9007199254740988kB, failcnt 0
[ 6972.013451] kmem: usage 1045024kB, limit 9007199254740988kB, failcnt 0
[ 6972.020113] Memory cgroup stats for /kubepods/podf78402b6-33bd-11e8-ba59-42010a8a0160: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 6972.044815] Memory cgroup stats for /kubepods/podf78402b6-33bd-11e8-ba59-42010a8a0160/c38dca77af0d0c47320ee1aeffe4224894ae05d14cb0a54cc0d7fcb5f781fd0f: cache:0KB rss:40KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:40KB inactive_file:0KB active_file:0KB unevictable:0KB
[ 6972.075069] Memory cgroup stats for /kubepods/podf78402b6-33bd-11e8-ba59-42010a8a0160/d7a947146cf94e188208fc1633a2f1275f9e75a2e8c7334133363a7743e81858: cache:116KB rss:3396KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:3396KB inactive_file:0KB active_file:4KB unevictable:0KB

is the cgroup stats from the OOM log. As you can see, kmem goes through the roof. I have no idea why, but the culprit doesn't appear to be the page cache or any of the processes in the container.

From what I can tell, kmem counts against the cgroup's memory limit so it looks like the kernel is hogging the memory.

Looking at the cgroups slabinfo, I can see a fairly large number of radix_node_tree slabs, though, I'm still not seeing where a whole GB of memory went.

Of note, the memory in question appears to be freed after the cgroup is destroyed.

I'm not 100% sure what's going on here but, to me, it looks like there may be a memory leak in the kernel OR I'm missing something (I personally find this second option to be slightly more plausible than the first).

Any ideas?

@thefirstofthe300
Copy link

Interesting...the nodes I executed the above tests on were GKE COS nodes

$ cat /etc/os-release
BUILD_ID=10323.12.0
NAME="Container-Optimized OS"
KERNEL_COMMIT_ID=2d7de0bde20ae17f934c2a2e44cb24b6a1471dec
GOOGLE_CRASH_ID=Lakitu
VERSION_ID=65
BUG_REPORT_URL=https://crbug.com/new
PRETTY_NAME="Container-Optimized OS from Google"
VERSION=65
GOOGLE_METRICS_PRODUCT_ID=26
HOME_URL="https://cloud.google.com/compute/docs/containers/vm-image/"
ID=cos
$ uname -a
Linux gke-yolo-default-pool-a42e49fb-1b0m 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Swapping the node out for an Ubuntu node seems to correct the memory usage. The pod never uses more than 600 MiB of RAM according to kubectl top pod.

This is looking more and more like some kind of memory leak or misaccounting that's present in the 4.4 series kernel used in COS nodes but not in the 4.13 series used by Ubuntu nodes.

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
$ uname -a 
Linux gke-yolo-pool-1-cb926a0e-51cf 4.13.0-1008-gcp #11-Ubuntu SMP Thu Jan 25 11:08:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@berlincount
Copy link

We seem to be having more success running our containers Ubuntu nodes ... so, I'd concur :)

@bitglue
Copy link

bitglue commented May 30, 2018

I'd like to share some observations, though I can't say I have a good solution to offer yet, other than to set a memory limit equal to the memory request for any pod that makes use of the file cache.

Perhaps it's just a matter of documenting the consequences of not having a limit set.

Or perhaps an explicit declaration of cache reservation should exist in the podspec, in lieu of assuming "inactive -> not important to reserve".

Another possibility I've not explored is cgroup soft limits, and/or a more heuristic based detection of memory pressure.

Contrary Interpretations of "Inactive"

Kubernetes seems to have an implicit belief that the kernel is finding the working set, and keeping it in the active LRU. Everything not in the working state goes in the inactive LRU and is reclaimable.

A quote from the documentation [emphasis added]:

The value for memory.available is derived from the cgroupfs instead of tools like free -m. This is important because free -m does not work in a container, and if users use the node allocatable feature, out of resource decisions are made local to the end user Pod part of the cgroup hierarchy as well as the root node. This script reproduces the same set of steps that the kubelet performs to calculate memory.available. The kubelet excludes inactive_file (i.e. # of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that memory is reclaimable under pressure.

Compare a comment from mm/workingset.c in the Linux kernel:

All that is known about the active list is that the pages have been accessed more than once in the past. This means that at any given time there is actually a good chance that pages on the active list are no longer in active use.

While both Kubernetes and Linux agree that the working set is in the active list, they disagree about where memory in excess of the working set goes. I'll show that Linux actually wants to minimize the size of the inactive list, putting all extra memory in the active list, as long as there's a process using the file cache enough for it to matter (which may not be the case if the workload on a node consists entirely of stateless web servers, for example).

The Dilemma

One running an IO workload on Kubernetes must:

  1. set a memory limit equal to or less than the memory request for any pod that utilizes the file LRU list, or
  2. accept that any IO workload will eventually exceed its memory request through normal and healthy utilization of the file page cache.

The Kernel Implementation

Note I'm not a kernel expert. These observations are based on my cursory study of the code.

When a page is first loaded, add_to_page_cache_lru() is called. Normally this adds the page to the inactive list, unless this is a "refault". More on that later.

Subsequent accesses to a page call mark_page_accessed() within mm/swap.c. If the page was on the inactive list it's moved to the active list, incrementing pgactivate in /proc/vmstat. Unfortunately this counter does not distinguish between the anonymous and file LRUs, but examining pgactivate in conjunction with nr_inactive_file and nr_active_file gives a clear enough picture. These same counters are available within memory.stat for cgroups as well.

Accessing a page twice is all that's required to get on the active list. If the inactive list is too big, there may not be enough room in the active list to contain the working set. If the inactive list is too small, pages may be pushed off the tail before they've had a chance to move to the active list, even if they are part of the working set.

mm/workingset.c deals with this balance. It forms estimates from the inactive file LRU list stats and maintains "shadow entries" for pages recently evicted from the inactive list. When add_to_page_cache_lru() is adding a page and it sees a shadow entry for that page it calls workingset_refault() and workingset_refault is incremented in /proc/vmstat. If that returns true then the page is promoted directly to the active list and workingset_activate in /proc/vmstat is incremented. It appears this code path does not increment pgactivate.

So a page accessed twice gets it added to the active list. What puts downward pressure on the active list?

During scans (normally by kswapd, but directly by an allocation if there are insufficient free pages), inactive_list_is_low() may return true. If it does, shrink_active_list() is called.

The comments to inactive_list_is_low() are insightful:

 * The inactive anon list should be small enough that the VM never has
 * to do too much work.
 *
 * The inactive file list should be small enough to leave most memory
 * to the established workingset on the scan-resistant active list,
 * but large enough to avoid thrashing the aggregate readahead window.
 *
 * Both inactive lists should also be large enough that each inactive
 * page has a chance to be referenced again before it is reclaimed.
 *
 * If that fails and refaulting is observed, the inactive list grows.
 *
 * The inactive_ratio is the target ratio of ACTIVE to INACTIVE pages
 * on this LRU, maintained by the pageout code. An inactive_ratio
 * of 3 means 3:1 or 25% of the pages are kept on the inactive list.
 *
 * total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB

So, the presence of refaults (meaning, pages are faulted, pushed off the inactive list, then faulted again) indicates the inactive list is too small, which means the active list is too big. If refaults aren't happening then the ratio of active:inactive is capped by a formula based on the total size of inactive + active. A larger cache favors a larger active list in proportion to the inactive list.

I believe (through I've not confirmed with experiment) that the presence of a large number of refaults could also mean there simply isn't enough memory available to contain the working set. The refaults will cause the inactive list to grow and the active list to shrink, causing Kubernetes to think there is less memory pressure, the opposite of reality!

@calvix
Copy link

calvix commented Jun 13, 2018

+1 👍

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

liutingjieni pushed a commit to bytedance/atop that referenced this issue Jul 27, 2023
The kubelet will terminate end-user pods when the worker node has
'MemoryPressure' according to [1]. But confusingly, there exits two
reasons for pods being evicted:
- one is the whole machine's free memory is too low,
- the other is k8s itself calculation[2], e.i. memory.available[3]
  is too low.

To resolve such confusion for k8s users, collect and show k8s global
workingset memory to distinguish between these two causes.

Note:
1. Only collect k8s global memory stats is enough, this is because
   cgroupfs stats are propagated from child to parent. Thus the
   parent can always notice the change and then updates. And From
   v1.6 k8s[4], allocatable(/sys/fs/cgroup/memory/kubepods/) is more
   convincing than capacity(/sys/fs/cgroup/memory/).
2. There are two cgroup drivers or managers to control resources:
   cgroupfs and systemd[5]. We should take both into account.
   (The 'systemd' cgroup driver always ends with '.slice')
3. The difference between cgroupv1 and cgroupv2: different field names
   for memory.stat file, and memory.currentUsage storing in different
   files (cgv1's memory.usage_in_bytes v.s. cgv2's memory.current).

[1]https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-out-of-memory-behavior
[2]kubernetes/kubernetes#43916
[3]memory.available = memory.allocatable/capacity - memory.workingSet,
   memory.workingSet = memory.currentUsage - memory.inactivefile
[4]kubernetes/kubernetes#42204
   kubernetes/community#348
[5]https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/

Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Reported-by: Teng Hu <huteng.ht@bytedance.com>
@lance5890
Copy link

maybe related to google/cadvisor#3286

@eldada
Copy link

eldada commented Oct 23, 2023

I forget where I read it but I heard that cgroups v2 was supposed to fix this. I think all you need to do is use a linux distro with cgroup v2 + update to kube 1.25 which makes kube cgroup v2 aware.

https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#:~:text=cgroup%20v2%20provides%20a%20unified,has%20graduated%20to%20general%20availability.

It might be worth testing if this is still an issue in the latest version.

Has anyone been able to confirm this?

@jrcichra
Copy link

We've noticed that prometheus with thanos sidecars increases active_file in the prometheus container due to multiple processes reading the same file (prometheus and thanos-sidecar). active_file gets as big as the the amount of prometheus data files you have on disk as that's what thanos is reading in full and shipping to S3.

All the containers in this scenario have a generous guaranteed memory limit.

With large enough prometheus data retention / metric count and limited enough memory, container_memory_working_set_bytes hovers between 90% and 95% of the container memory limit, but the kernel will start shrinking active_file once it needs more memory for either other files or prometheus needs more memory (RSS).

This makes it difficult to alert and know when prometheus instances with thanos sidecars are running out of memory and are close to OOM. I'm sure this applies to other heavy data workloads such as a mysql database with a backup sidecar.

In short, container_memory_working_set_bytes isn't always accurately describing when a container is nearing OOM from the kernel. In most cases it does, but not always.

It would be great if container_memory_working_set_bytes or a different metric excluded active_file and any other forms of evictable memory so we could accurately alert on close to OOM'ing containers consistently.

@jrcichra
Copy link

I've opened a cAdvisor PR to expose a non-evictable memory metric: google/cadvisor#3445

Right now it excludes both inactive_file and active_file. The goal for the PR is to have a metric which (as accurately as it can) exposes total non-evictable memory. That may mean additional cgroup fields are included/excluded over time.

@zakisaad
Copy link

zakisaad commented Jan 20, 2024

https://github.com/linchpiner/cgroup-memory-manager I am using this for workaround, for now, it works ok

Another +1 for this running as a Daemonset - immediate relief on memory pressure (cursory glance at container_working_set_bytes is very promising).

It'll do for now 🤷

@Raboo
Copy link

Raboo commented Jan 27, 2024

https://github.com/linchpiner/cgroup-memory-manager I am using this for workaround, for now, it works ok

Another +1 for this running as a Daemonset - immediate relief on memory pressure (cursory glance at container_working_set_bytes is very promising).

It'll do for now 🤷

This works by clearing the page cache it seems, some apps can get their performance crippled with clearing the page cache if they are read intensive apps. It is a smart work-around, but be aware of what type of app you are running and how clearing the cache continuously might affect performance negatively, especially with network based storages.

@howiezhao
Copy link

maybe somebody will find it useful: we experienced problems running spark on kube when pods were killed. We also have seen cache in pod rising constantly. And at first glance it seemed very connected to the discussion here. However the real problem came from malloc. Using jmalloc solved the issue for us. And while the cache is still rising the pod is not killed.

Hi @IgorBerman Would you mind sharing more details about this issue? We seem to have encountered a similar problem.

@hterik
Copy link

hterik commented Feb 20, 2024

I forget where I read it but I heard that cgroups v2 was supposed to fix this. I think all you need to do is use a linux distro with cgroup v2 + update to kube 1.25 which makes kube cgroup v2 aware.
https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#:~:text=cgroup%20v2%20provides%20a%20unified,has%20graduated%20to%20general%20availability.
It might be worth testing if this is still an issue in the latest version.

Has anyone been able to confirm this?

Using cgroups v2 does not seem to solve the problem. We have an AKS cluster, that runs cgroups v2 as default, and are still experiencing same problem.

@Bfault
Copy link

Bfault commented Mar 7, 2024

I forget where I read it but I heard that cgroups v2 was supposed to fix this. I think all you need to do is use a linux distro with cgroup v2 + update to kube 1.25 which makes kube cgroup v2 aware.
https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#:~:text=cgroup%20v2%20provides%20a%20unified,has%20graduated%20to%20general%20availability.
It might be worth testing if this is still an issue in the latest version.

Has anyone been able to confirm this?

Using cgroups v2 does not seem to solve the problem. We have an AKS cluster, that runs cgroups v2 as default, and are still experiencing same problem.

same as hterik,
I'm running a MongoDB on Kubernetes in an aks and I'm still getting my pod evicted with kubernetes v1.27.9, cgroups v2 and Ubuntu-2204

@meijie-xiang
Copy link

I have recently also faced this issue in cluster when exporting data to a cloud storage periodically. The memory manager (https://github.com/linchpiner/cgroup-memory-manager) did not work for me very well as it destroyed the performance of my original functionality. As I/O operation consumes page cache in linux by default, another workaround is to use the O_DIRECT flag to read/write data to bypass page cache in the system. It consumed quite some cpu & ram and affected performance to some extent, but we eventually managed to export files with an acceptable compromise in performance by setting a relatively large export batch size via testing. Hope this helps a bit if you also have similar memory issues with frequent I/O operations in cluster.

@ozzieba
Copy link

ozzieba commented Apr 2, 2024

Above workaround unfortunately only works with cgroups v1, I believe this is a rough, quick-and-dirty bash equivalent for cgroups v2, only partly tested (I believe it works after running it for a couple days, but haven't rigorously tested):

      - command:
        - /bin/bash
        - -c
        - |
          #!/bin/bash
          total_memory=$(awk '/MemTotal/ {print $2}' /host/proc/meminfo)
          threshold=$((total_memory * 10 / 100))
          if [ $threshold -gt 2097152 ]; then
            threshold=2097152
          fi
          reclaim_amount=$((threshold * 2 ))
          while true; do
            free_memory=$(awk '/MemFree/ {print $2}' /host/proc/meminfo)
            if [ $free_memory -lt $threshold ]; then
              echo "Free memory below 2GB and 10%: $((free_memory / 1024)) MB"
              while [ $free_memory -lt $(( $threshold * 2 )) ]; do
                echo "$threshold"K|tee /host/sys/fs/cgroup/memory.reclaim || true #TODO: research if this can fail for begin too much to reclaim, for now rely on next line running repeatedly
                echo 100M|tee /host/sys/fs/cgroup/memory.reclaim || true
                free_memory=$(awk '/MemFree/ {print $2}' /host/proc/meminfo)
              done
            fi
            sleep 1
          done
        image: google/cloud-sdk:latest
        securityContext:
          privileged: true # TODO: check if this is actually needed
        volumeMounts:
        - mountPath: /host/sys
          name: host-sys
        - mountPath: /host/proc
          name: host-proc

(part of a Daemonset definition that also dynamically provisions disks for swap)

@alvaroaleman
Copy link
Member

Cgroup v2 doesn't have this issue, not sure what you are expecting the DS to do?

@ozzieba
Copy link

ozzieba commented Apr 2, 2024

@alvaroaleman interesting, are you able to confirm that cgroups v2 doesn't have this issue (given conflicting reports above)? If so I am hitting a rather similar-looking issue, specifically running with swap, that I am 90% confident is solved by the above script (based on a workload which was reliably triggering "The node was low on resource: memory. Threshold quantity: 100Mi", and now isn't; I can confirm that there's definitely enough memory+swap for the workload at every point)

Also, I can confirm that the script did not work when I was looking at MemAvailable rather than MemFree

polyrabbit added a commit to polyrabbit/juicefs that referenced this issue Apr 18, 2024
jrcichra added a commit to jrcichra/cadvisor that referenced this issue Apr 24, 2024
The goal of this PR is to have additional cAdvisor metrics which
expose total_active_file and total_inactive_file.

Today working_set_bytes subtracts total_inactive_file in its calculation,
but there are situations where exposing these metrics directly is valuable.

For example, two containers sharing files in an emptyDir increases total_active_file over time. This is not tracked in the working_set memory.

Exposing total_active_file and total_inactive_file to the user
allows them to subtract out total_active_file or total_inactive_file
if they so choose in their alerts.

In the case of prometheus with a thanos sidecar, working_set can give
a false sense of high memory usage. The kernel counts thanos reading prometheus written files as "active_file" memory.
In that situation, a user may want to exclude active_file from their ContainerLowOnMemory alert.

Relates to: kubernetes/kubernetes#43916
jrcichra added a commit to jrcichra/cadvisor that referenced this issue Apr 24, 2024
The goal of this PR is to have additional cAdvisor metrics which
expose total_active_file and total_inactive_file.

Today working_set_bytes subtracts total_inactive_file in its calculation,
but there are situations where exposing these metrics directly is valuable.

For example, two containers sharing files in an emptyDir increases total_active_file over time. This is not tracked in the working_set memory.

Exposing total_active_file and total_inactive_file to the user
allows them to subtract out total_active_file or total_inactive_file
if they so choose in their alerts.

In the case of prometheus with a thanos sidecar, working_set can give
a false sense of high memory usage. The kernel counts thanos reading prometheus written files as "active_file" memory.
In that situation, a user may want to exclude active_file from their ContainerLowOnMemory alert.

Relates to: kubernetes/kubernetes#43916
jrcichra added a commit to jrcichra/cadvisor that referenced this issue Apr 24, 2024
The goal of this PR is to have additional cAdvisor metrics which
expose total_active_file and total_inactive_file.

Today working_set_bytes subtracts total_inactive_file in its calculation,
but there are situations where exposing these metrics directly is valuable.

For example, two containers sharing files in an emptyDir increases total_active_file over time. This is not tracked in the working_set memory.

Exposing total_active_file and total_inactive_file to the user
allows them to subtract out total_active_file or total_inactive_file
if they so choose in their alerts.

In the case of prometheus with a thanos sidecar, working_set can give
a false sense of high memory usage. The kernel counts thanos reading prometheus written files as "active_file" memory.
In that situation, a user may want to exclude active_file from their ContainerLowOnMemory alert.

Relates to: kubernetes/kubernetes#43916
@emzxcv
Copy link

emzxcv commented Apr 25, 2024

@alvaroaleman Also keen to know if cgroups v2 solves this problem. What are the steps you took to verify this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests