Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubelet needs to allow configuration of container memory-swap #7294

Closed
derekwaynecarr opened this issue Apr 24, 2015 · 44 comments
Closed

Kubelet needs to allow configuration of container memory-swap #7294

derekwaynecarr opened this issue Apr 24, 2015 · 44 comments
Labels
area/isolation priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@derekwaynecarr
Copy link
Member

I would like the Kubelet to support a flag like the following:

--container-memory-swap-limit-factor=0.1  (defaults to 0, meaning no memory-swap)

To control the default amount of MemorySwap that is given to a container as a percentage of their requested memory limit.

See https://docs.docker.com/reference/run/#runtime-constraints-on-cpu-and-memory for the third-row which is the behavior we show today when the user sets a memory-limit, but the kubelet does not specify memory-swap

(specify memory without memory-swap) The container is not allowed to use more than L bytes of memory, swap plus memory usage is double of that.

We think using memory-swap at all is a big step to take, and if the administrator wanted to allow for memory swap, it should not default to 2L where (L is the memory specified).

For affected area of Kubelet, see:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/dockertools/manager.go#L442

For value we should explicitly set, see:
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/Godeps/_workspace/src/github.com/fsouza/go-dockerclient/container.go#L177

@derekwaynecarr
Copy link
Member Author

/cc @thockin this aligns with our IRC discussion.

@derekwaynecarr
Copy link
Member Author

/cc @danmcp per our chat as well.

@davidopp davidopp added sig/node Categorizes an issue or PR as relevant to SIG Node. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Apr 24, 2015
@thockin thockin assigned dchen1107 and unassigned thockin Apr 24, 2015
@nkwangleiGIT
Copy link
Contributor

Hi all,
For this issue, so we want to support memory swap at both kubelet and API level?

  1. Add option --container-memory-swap-limit-factor=0 to kubelet, and it'll be a global default settings for memory swap
  2. In the container's resource limits, we can also specify something like container.resources.limits.memoryswap=1, then it'll use 1L as memory swap(L is the memory specified). And it'll map to 2L of docker memory swap, as this value in docker is the summary of memory and memory swap.

We're also customizing k8s to support memory swap in our project, so it can leverage docker memory swap. I'd like to contribute code using this issue, is my understanding above correct?
Let me know if any design change, thanks!

@erictune
Copy link
Member

I'd rather not put factors into a ResourceList, as is implied in item 2 of previous comment. All the items in resources.limits should be additive. If you want to allow users to specify this from the API, then they can just specify an absolute amount of swap, or if they leave it empty, then some admission controller could default it, and they could set it to an explicit value of 0 or maybe 1 if they don't want any swap.

@nkwangleiGIT
Copy link
Contributor

Hi erictune,
I agree with you on option 2, we'd better to let the user specify an absolute value for memory swap instead of factors in API approach, it'll make it consistent with Docker, and sure we should have default value if the user doesn't use it. How about the question below, the memory swap here should be only memory swap or the summary?

And it'll map to 2L of docker memory swap, as this value in docker is the summary of memory and memory swap.

How about kubelet level support, do we still need it?
Thanks.

@erictune
Copy link
Member

The default value should be to not use swap at all.

I think docker chose their format to be just like the cgroups interface. But, I don't think we need to copy that pattern. I'd rather that the resources list does not double-count resources or mix resources. It should be easy to add up resource lists like you add up vectors, to get aggregate amounts of resources.

So, I think swap should just be the amount of swap, not swap + memory. And it should be called "swap" not "memoryswap", to reflect that.

One case to think about: Swapfiles can be on rotating disk or on SSD. This gives very different performance characteristics. Also, the scheduler needs to subtract any swap space from the storage space of the node when a pod is bound to it. It needs to know whether to subtract that from rotating disk or SSD. Should, should we have separate "diskswap" and "ssdswap" resources?

@nkwangleiGIT
Copy link
Contributor

Hi erictune,
that's a good point, but I'm not sure why we need separate resource for "diskswap" and "ssdswap", the user can use normal disk or SSD memory swap as they need from performance perspective, but we only need to substract the size of swap when calculate the space size of a node.
Do you mean the swap size maybe differenent depending on it's normal disk or SSD?
Thanks for help!

@erictune
Copy link
Member

I was thinking of the case where swap files are added dynamically by
kubelet in response to pods being started. In this model, the scheduler
needs to subtract the swap from the resources of the machine, and it needs
to know which resources to reduce..

However, if nodes are statically provisioned with swap, and they exclude
the provisioned swap resources from the reported capacity, then I agree we
can do it like you say.

On Sun, Jun 28, 2015 at 11:20 PM, WangLei notifications@github.com wrote:

Hi erictune,
that's a good point, but I'm not sure why we need separate resource for
"diskswap" and "ssdswap", the user can use normal disk or SSD memory swap
as they need from performance perspective, but we only need to substract
the size of swap when calculate the space size of a node.
Do you mean the swap size maybe differenent depending on it's normal disk
or SSD?
Thanks for help!


Reply to this email directly or view it on GitHub
#7294 (comment)
.

@nkwangleiGIT
Copy link
Contributor

Hi erictune,
OK, I see, then we may also want to use 'memory swap' as a factor when the scheduler tries to calculate the fit node and schedule the pod, and we also need to know the total size of swap on each node someway.
Do we already have the swap size or capability added in current scheduler and kubelet implementation? it'll probably need more change if we want to add them all, thanks!

@thockin
Copy link
Member

thockin commented Jun 29, 2015

I don't think we should allow pods to add swap to the system. I'd rather
see something simple like a "max swap" which defaults to 0. If the system
has swap enabled, the pod can use up to that much swap. If the system does
not have swap, too bad. This is bad from an isolation (IOPS, IO bandwidth)
point of view, but we have nothing there now, so we can maybe revisit later?

On Mon, Jun 29, 2015 at 8:32 AM, WangLei notifications@github.com wrote:

Hi erictune,
OK, I see, then we may also want to use 'memory swap' as a factor when the
scheduler tries to calculate the fit node and schedule the pod, and we also
need to know the total size of swap on each node someway.
Do we already have the swap size or capability added in current scheduler
and kubelet implementation? it'll probably need more change if we want to
add them all, thanks!


Reply to this email directly or view it on GitHub
#7294 (comment)
.

@nkwangleiGIT
Copy link
Contributor

Hi thockin,
OK, we can revisit this after v1.
BTW, for now we enabled the memory swap of pods in our k8s system, just follow what Docker expect. Thanks!

@dchen1107
Copy link
Member

I don't want to support memory-swap by default, at least for a near future

  1. It won't work well with our qos proposal: https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/proposals/resource-qos.md

  2. The workloads asking for Guaranteed resource will not use it. They need repeatable latency to memory. They want predictability on performance too.

  3. Adding swap slows down jobs and introducing more bandwidth to disk and isolation issues. We don't manage disk io yet, and it is hard to manage too. Without better disk io management, simply enabling swap for container/pod is bad solution.

On another side, I understand some users might want to use swap no matter what; or some application is running on a dedicate node without any sharing, which still require swap space. @brendandburns proposed a way to simply pass-through such request to docker engine for certain container / pod with privileges. cc/ @bgrant0607

@bgrant0607
Copy link
Member

What is the use case for swap? If swap is enabled for any containers on a host, then no latency guarantees can be provided for other containers.

@dchen1107
Copy link
Member

Unassign myself since we don't want to add swap support for now unless someone provides me a real appealing use cases.

@dchen1107 dchen1107 removed their assignment Nov 2, 2015
@streamnsight
Copy link

what's the proper way to handle this then? disable swap on the host?

@erictune
Copy link
Member

Disabling swap is a good approach. When you have multiple containers and multiple machines they could be scheduled on, it is better to kill one container, than to have all the containers on a machine operate at an unpredictable, probably slow, rate.

@mikedanese
Copy link
Member

cc @vishh

@vishh
Copy link
Contributor

vishh commented May 19, 2016

Based on the existing comments, this PR can be closed since there is no intention to support swap in the kubelet in the near future. @derekwaynecarr Feel free to re-open if you'd like to continue discussing support for swap.

@vishh vishh closed this as completed May 19, 2016
@derekwaynecarr
Copy link
Member Author

Yeah. This can be closed. I actually don't want to support swap right now
either, when this was first opened swap was being set incorrectly. But all
is good now.

On Thursday, May 19, 2016, Vish Kannan notifications@github.com wrote:

Closed #7294 #7294.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#7294 (comment)

@yoshuawuyts
Copy link

To contribute a use case for future reference: Swap space for running npm install on smaller nodes is essential. Small being about <1GB RAM.

When npm install is run, the dependency graph can become a lot larger than the amount of memory available, causing npm install to reliably crash. Being able to allocate swap for this step would be tremendously helpful, if anything just to build a new container with all dependencies compiled that can then be redistributed.

Hope this is somewhat helpful. Thanks!

See Also

@derekwaynecarr
Copy link
Member Author

I would rather support swap as a cheap means of supporting overcommit, but
there are a lot of intermediate steps to get from a to b. Would prefer we
finish up existing QOS capability in 1.5 before supporting swap as an
advanced usage scenario at that point.

On Monday, August 15, 2016, Yoshua Wuyts notifications@github.com wrote:

IMHO, supporting swap is not worth the cost of making micro node clusters
reliable

Yeah agree, there's probably more pressing things to work on right now.
All I'm hoping to achieve is that when future discussions about adding swap
/ disk management occur, you'll take the use case of tiny nodes into
account C:


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbCe1WnqzCnahXpRXFweVnWkpfbymks5qgOCtgaJpZM4EH_zA
.

@derekwaynecarr
Copy link
Member Author

And to be clear this original issue was because Kubelet passed the wrong
value (allowing swap) and I wanted to reduce or eliminate it. This was far
before QOS was fully defined.

On Wednesday, August 17, 2016, Derek Carr decarr@redhat.com wrote:

I would rather support swap as a cheap means of supporting overcommit, but
there are a lot of intermediate steps to get from a to b. Would prefer we
finish up existing QOS capability in 1.5 before supporting swap as an
advanced usage scenario at that point.

On Monday, August 15, 2016, Yoshua Wuyts <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

IMHO, supporting swap is not worth the cost of making micro node clusters
reliable

Yeah agree, there's probably more pressing things to work on right now.
All I'm hoping to achieve is that when future discussions about adding swap
/ disk management occur, you'll take the use case of tiny nodes into
account C:


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbCe1WnqzCnahXpRXFweVnWkpfbymks5qgOCtgaJpZM4EH_zA
.

@ravilr
Copy link
Contributor

ravilr commented Oct 7, 2016

@derekwaynecarr @vishh sorry to comment on a closed issue, but this has some context here:

Kubelet seems to set MemorySwap (libcontainer config) to -1 by default here:
https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/kubelet/dockertools/docker_manager.go#L662
https://github.com/kubernetes/kubernetes/blob/v1.4.0/vendor/github.com/docker/engine-api/types/container/host_config.go#L253

which means /sys/fs/cgroup/memory.memsw.limit_in_bytes is set to unlimited (mem + swap) for the container.

Isn't the Memory+swap should also be set to container.Resources.Limits.Memory().Value() above. Or am i missing something.

@vishh
Copy link
Contributor

vishh commented Oct 7, 2016

Our approach has been that of asking users to not enable swap, but we never
really disallowed usage of swap. Given that it interferes with our ability
to provide QoS features, we are starting to disallow swap. We can either
disallow swap completely and/or have each container configured to not use
swap.

On Fri, Oct 7, 2016 at 3:18 PM, ravilr notifications@github.com wrote:

@derekwaynecarr https://github.com/derekwaynecarr @vishh
https://github.com/vishh sorry to comment on a closed issue, but this
has some context here:

Kubelet seems to set MemorySwap (libcontainer config) to -1 by default
here:
https://github.com/kubernetes/kubernetes/blob/v1.4.0/pkg/
kubelet/dockertools/docker_manager.go#L662

which means /sys/fs/cgroup/memory.memsw.limit_in_bytes is set to
unlimited (mem + swap) for the container.

Isn't the Memory+swap should also be set to container.Resources.Limits.Memory().Value()
above. Or am i missing something.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7294 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKI0Jp9VwgtgUtpamjEr5UGLNynNKks5qxsVDgaJpZM4EH_zA
.

@ravilr
Copy link
Contributor

ravilr commented Oct 10, 2016

@derekwaynecarr @vishh setting memsw.limit to unlimited by default, doesn't seem right. Can we fix that. Not every environment has swap disabled, by default.

@hongxima
Copy link

Dear all, there is a new use case where we really need either:
a) Disable swap by default (set MemorySwap=0) or
b) Provide an option to do so for kubelet

It's about Windows Docker compatibility, it was all good when using Windows docker version 1.12.2-cs2-ws-beta, but after upgraded to 1.14.0-dev, Windows Docker doesn't start any containers with below error output:
"invalid option: Windows does not support MemorySwap"
The meantime MemorySwap was set to "-1" for "unlimited", but the code expects it to be zero:

https://github.com/docker/docker/blob/master/daemon/daemon_windows.go#L175

Disabling MemorySwap for every container is an alternative, but that would almost kill me in a long term point of view.

@vishh
Copy link
Contributor

vishh commented Dec 19, 2016 via email

@ozbillwang
Copy link

ozbillwang commented Dec 19, 2016

@vishh

Is this statement for Linux as well?

Do you have any document for swap management in Kubernetes? Or any road map that it may support in the future?

@vishh
Copy link
Contributor

vishh commented Dec 20, 2016 via email

@derekwaynecarr
Copy link
Member Author

derekwaynecarr commented Dec 20, 2016 via email

@hongxima
Copy link

I've created a separated issue #39003 specifically pointing to the showstopper for Windows case. @brendanburns provided a PR #39005 to disable MemorySwap only when it's Windows OS. Which keeps the current behavior for Linux (as of "unlimited memory swap") and also make Windows Docker working, it seems to be a proper solution at least for my case.

@munkyboy
Copy link

FWIW, the default behavior is to disable swap since 1.6.0

@jonmoter
Copy link

jonmoter commented Jan 7, 2018

Sorry to comment on an old dead thread, but I was looking at the commit that disabled swap, and comparing that to the docker docs on swap.

In particular, the Docker docs say:

  • If --memory-swap is set to 0, the setting is ignored, and the value is treated as unset.
  • If --memory-swapis set to the same value as --memory, and --memory is set to a positive integer, the container will not have access to swap. See Prevent a container from using swap.
  • If --memory-swap is unset, and --memory is set, the container can use twice as much swap as the --memory setting, if the host container has swap memory configured. For instance, if --memory="300m" and --memory-swap is not set, the container can use 300m of memory and 600m of swap.

So the way I'm reading the docs, if kubelet sets memory swap to 0, then that will allow the container to have 2x its memory limit in swap. To disable swap, --memory-swap should have the same value as the memory limit we set for the container.

Am I missing something here? Or is this a bug?

@jonmoter
Copy link

jonmoter commented Jan 8, 2018

FYI, I'm running a 1.6.11 Kubernetes cluster, and confirmed that the pods running in that cluster are being allocated swap, at 2x the amount of memory that their limit is set to. (The below is a container running kube-state-metrics.)

[root@node ~]# docker inspect b77bf3c86acc | grep Memory
            "Memory": 150000000,
            "KernelMemory": 0,
            "MemoryReservation": 0,
            "MemorySwap": 300000000,
            "MemorySwappiness": -1,

@embano1
Copy link
Member

embano1 commented Jan 22, 2018

@jonmoter Recent kubelet versions have a flag to work around this effect, by checking whether swap is turned on.

--fail-swap-on (default: true)
https://kubernetes.io/docs/reference/generated/kubelet/

The kubelet won't start in this case. In your version, this setting was not implemented imho.

@Dominik-K
Copy link

@jonmoter Even you're right Kubernetes sets MemorySwap = 2 * Memory, the container will be out-of-memory (OOM) killed when Memory will be allocated. You can use my small test image: dominikk/swap-test.

@jonmoter
Copy link

jonmoter commented Feb 2, 2018

@Dominik-K, the container will be OOMkilled when Memory + Swap is exhausted. So if I have a container where I set the memory limit to 100MB, it will be killed when it exceeds ~300MB of total memory consumption.

I can see this happening in the containers running in my system. And in the stackexchange answer you link in your image, it specifically says (emphasis mine):

The OOM killer should kill this once you are out of RAM and SWAP to give to the program.

My point in bringing all this up was that @munkyboy claimed above that swap was disabled by default in 1.6.0. That's not true. Starting in 1.6, swap is set to be Memory * 2 for each container.

And yes, in recent versions, there's the --fail-swap-on flag. That makes the point moot if you have swap disabled on your host.

k8s-github-robot pushed a commit that referenced this issue Feb 27, 2018
Automatic merge from submit-queue (batch tested with PRs 50724, 59025, 59710, 59404, 59958). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

dockertools: disable MemorySwap on Linux

In this commit, set `MemorySwap` the same with `Memory` to prevent using swap on Linux.

**What this PR does / why we need it**:

In #39731, @pires tried to disable swap on Linux by setting `MemorySwap` to 0.
However, according to [Docker's docs](https://docs.docker.com/config/containers/resource_constraints/#--memory-swap-details), setting `MemorySwap` to 0 is treated as unset, and its [default behavior](https://github.com/moby/moby/blob/v17.05.0-ce/daemon/daemon_unix.go#L266-L269) is to set to twice the size of `Memory`, which can still cause the container to use the swap.

**Which issue(s) this PR fixes** :

This issue was mentioned in this comment: #7294 (comment)

**Special notes for your reviewer**:

1. For the case on Windows, we can still use the 0 because [Windows does not support `MemorySwap`](https://github.com/moby/moby/blob/v17.05.0-ce/daemon/daemon_windows.go#L185-L187).
2. There is another place using the `DefaultMemorySwap()` is for [sandbox](https://github.com/kubernetes/kubernetes/blob/v1.9.2/pkg/kubelet/dockershim/docker_sandbox.go#L505).
Maybe setting the sandbox's `MemorySwap` to 0 is fine. I didn't change that.

**Release note**:

```release-note
dockertools: disable memory swap on Linux.
```
@huggsboson
Copy link
Contributor

@jonmoter @Dominik-K This exact thing happened to us when we noticed that one of our java services was running super slow, and we debugged it to container.resources.limits.memory = 1Gi and the jvm Xms = 3Gi Xmx = 3Gi. So it wasn't getting OOM Killed but 2Gi of address space were swapped.

@vishh
Copy link
Contributor

vishh commented Jul 24, 2018

@huggsboson what version of k8s were you using? Is kubelet flag --fail-swap-on set to false?

@huggsboson
Copy link
Contributor

huggsboson commented Jul 24, 2018

We're still back on 1.6. We plan on disabling swap going forward. Just wanted to document some the bizarre and hard to debug behavior you can see with older versions of kube with swap enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/isolation priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests