Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement IPVS-based in-cluster service load balancing #44063

Closed
ghost opened this issue Apr 4, 2017 · 50 comments
Closed

Implement IPVS-based in-cluster service load balancing #44063

ghost opened this issue Apr 4, 2017 · 50 comments
Labels
area/ipvs area/kube-proxy kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@ghost
Copy link

ghost commented Apr 4, 2017

At KubeCon Europe in Berlin last week I presented some work we've done at Huawei scaling Kubernetes in-cluster load balancing to 50,000+ services and beyond, the challenges associated with doing this using the current iptables approach, and what we've achieved using an alternative IPVS-based approach. iptables is designed for firewalling, and based on in-kernel rule lists, while IPVS is designed for load balancing and based on in-kernel hash tables. IPVS also supports more sophisticated load balancing algorithms than iptables (least load, least conns, locality, weighted) as well as other useful features (e.g. health checking, retries etc).

After the presentation, there was strong support (a.k.a. a riot :-) ) for us to open source this work, which we are happy to do. We can use this issue to track that.

For those who were not able to be there, here is the video:

https://youtu.be/c7d_kD2eH4w

And the slides:

https://docs.google.com/presentation/d/1BaIAywY2qqeHtyGZtlyAp89JIZs59MZLKcFLxKE6LyM/edit?usp=sharing

We will follow up on this with a more formal design proposal, and a set of PR's, but in summary we added a about 680 lines of code to the existing 12,000 lines of kube-proxy (~5%), and added a third mode flag to it's command-line (mode=IPVS, to the existing mode=userspace and mode=iptables).
Performance improvement of load balancer updates is dramatic (update latency reduced from hours per rule to 2ms per rule). Network latency and variability also reduced dramatically for large numbers of services.

@kubernetes/sig-network-feature-requests
@kubernetes/sig-scalability-feature-requests
@thockin
@wojtek-t

@ghost ghost added area/kube-proxy kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. team/huawei labels Apr 4, 2017
@gyliu513
Copy link
Contributor

gyliu513 commented Apr 5, 2017

Even though Kubernetes 1.6 support 5000 nodes, but the kube-proxy with iptables is actually a bottleneck to scale the cluster to 5000 nodes. One example is that with NodePort service in a 5000 node cluster, if I have 2000 services and each services have 10 pods, this will cause 20000 iptable records on each worker node, and this can make the kernel pretty busy. Using IPVS-based in-cluster service load balancing can help a lot for such case.

@ravilr
Copy link
Contributor

ravilr commented Apr 5, 2017

@quinton-hoole is your implementation using IPVS in nat mode or Direct routing mode?

@haibinxie
Copy link

@ravilr it's NAT mode.

@timothysc
Copy link
Member

gr8 work, the iptables issues has been a problem for a while.

re: flow based scheduling, happy to help get the firmament scheduler in place. ;-)
/cc @kubernetes/sig-scheduling-feature-requests

@resouer
Copy link
Contributor

resouer commented Apr 6, 2017

@timothysc I paid attention to firmament for a period of time, but not quite get it's value adding to kubernetes. Would you mind to explain what problem flow based scheduling can solve in current Kubernetes scheduler?

@timothysc
Copy link
Member

@resouer speed at scale and rescheduling.

From @quinton-hoole 's talk, linked above, it looks like huawei has been prototyping this.

@ghost
Copy link
Author

ghost commented Apr 6, 2017

@resouer @timothysc Yes, I can confirm that we're working on a Firmament scheduler, and will upstream it as soon as it's in good enough shape. We might have an initial implementation in the next few weeks.

@deepak-vij
Copy link

Hi folks, we are currently working on implementing Firmament Scheduler as part of K8S scheduling environment. We will create a new separate issue to track the progress and provide updates etc. thanks.

@sureshvis
Copy link

@quinton-hoole thanks for sharing. Waiting to see the design proposal.

In term of Healthcheck, every worker doing health check across all pods to keep the table upto date?, how are you planning to handle this @scale ?

@ghost
Copy link
Author

ghost commented Apr 6, 2017

cc @haibinxie

@ghost
Copy link
Author

ghost commented Apr 6, 2017

To be clear, @haibinxie did all the hard work here. Please direct questions to him.

@MikeSpreitzer
Copy link
Member

IPVS only deals with IP, not transport protocols, right? A k8s service can include a port transformation. A Service object has a potential distinction between port and targetPort in the items in the spec.ports list. And an Endpoints also has a ports.port field in items in the subsets list. Can your implementation handle this generality, and if not, what happens when the user asks for it?

@haibinxie
Copy link

@MikeSpreitzer Port transformation is well supported.

@thockin
Copy link
Member

thockin commented Apr 10, 2017 via email

@thockin
Copy link
Member

thockin commented Apr 10, 2017

Port mapping is handled in NAT mode (called masquerade by IPVS, sadly). As an optimization, a future followup could enable direct-return mode for environments that support it for services that do not do remapping. We'd have to add service IPs as local addresses in pods, which we may want to do anyway.

@thockin
Copy link
Member

thockin commented Apr 10, 2017

Last comment for the record here, though I have said it elsewhere.

I am very much in favor of an IPVS implementation. We have somewhat more than JUST load-balancing in our iptables (session affinity, firewalls, hairpin-masquerade tricks), but I believe those can all be overcome.

We also have been asked, several times, to add support for port ranges to Services, up to and including a whole IP. The obvious way to add this would also support remapping, though it is not at all clear how NodePorts would work. IPVS, as far as I know, has no facility for exposing ranges of ports.

@ChenLingPeng
Copy link
Contributor

In IPVS mode, we have to add all the service address to host device like lo or ethx, am I right?

@haibinxie
Copy link

Hi All, I put together a proposal for the alpha version of IPVS implementation hoping to get into kubernetes 1.7. need your feedback.

https://docs.google.com/document/d/1YEBWR4EWeCEWwxufXzRM0e82l_lYYzIXQiSayGaVQ8M/edit?usp=sharing

@kubernetes/sig-network-feature-requests
@kubernetes/sig-scalability-feature-requests
@thockin
@wojtek-t

@thockin thockin added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/network Categorizes an issue or PR as relevant to SIG Network. labels May 16, 2017
@dhilipkumars
Copy link

FYI
if docker accepts this PR, we may be able to loose seesaw (libnl.so) dependency. moby/libnetwork#1770

@gyliu513
Copy link
Contributor

Does the kube-router can help this?

@thockin
Copy link
Member

thockin commented May 27, 2017 via email

@ghost
Copy link
Author

ghost commented May 27, 2017 via email

@thockin
Copy link
Member

thockin commented May 27, 2017 via email

@dujun1990
Copy link

dujun1990 commented May 28, 2017

@thockin @quinton-hoole

Initial PR #46580 is already sent out. PTAL.

@dhilipkumars
Copy link

@thockin originally we were relying on seesaw library and had a plan of updating it to a pure go implementation as phase 2 (probably in 1.8). Because of the complexities introduced by libnl.so dependencies last week we decided to move away from seesaw. Docker's libnetwork had a good set of ipvs apis but was missing GETXXX() methods. We quicky contributed to libnetwork and that got merged 3 days ago. Now we have vendored libnetwork. PTAL.

k8s-github-robot pushed a commit that referenced this issue Aug 30, 2017
Automatic merge from submit-queue (batch tested with PRs 51377, 46580, 50998, 51466, 49749)

Implement IPVS-based in-cluster service load balancing

**What this PR does / why we need it**:

Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube-proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on).

**Which issue this PR fixes**

#17470 #44063

**Special notes for your reviewer**:


* Since the PR is a bit large, I splitted it and move the commits related to ipvs util pkg to PR #48994. Hopefully can make it easier to review.

@thockin @quinton-hoole @kevin-wangzefeng @deepak-vij @haibinxie @dhilipkumars @fisherxu 

**Release note**:

```release-note
Implement IPVS-based in-cluster service load balancing
```
@m1093782566
Copy link
Contributor

/area ipvs

@m1093782566
Copy link
Contributor

IPVS-based kube-proxy is in beta phase now.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 7, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 15, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@chrishiestand
Copy link
Contributor

With this issue closed as stale, is there a better issue to follow progress of adding ipvs scheduling algorithms to individual kubernetes services? I couldn't find another issue that explicitly covers this part of the ipvs roadmap.

@arzarif
Copy link

arzarif commented Feb 16, 2019

@chrishiestand

It's mentioned here as a future possibility, but like you, I haven't been able to find an issue where this is being actively discussed/worked on. Does anyone know whether there's ongoing work to make service-specific load-balancing algorithms possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipvs area/kube-proxy kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests