Skip to content

Bind GPUs and workers #2511

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

vigneshwaran-nv-10329
Copy link

Distribute workers among multiple GPUs.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Fix lint
@tilgovi
Copy link
Collaborator

tilgovi commented Feb 16, 2021

You should implement this kind of custom logic with Server Hooks. You can probably use pre_fork or post_fork and child_exit for what you want.

@tilgovi tilgovi closed this Feb 16, 2021
@tilgovi
Copy link
Collaborator

tilgovi commented Feb 16, 2021

Thank you for taking the time to make a PR, but I don't think this is a generic feature that Gunicorn needs to implement.

@benoitc
Copy link
Owner

benoitc commented Feb 16, 2021

@tilgolvi i think that should be discussed. I’m curious to hear about the current usage of it. @vignesh34v can you elaborate?

@benoitc benoitc reopened this Feb 16, 2021
@benoitc
Copy link
Owner

benoitc commented Feb 16, 2021

I think supporting guys mah be useful but it would be probably better to not mix the type of processes in any case.

@tilgovi
Copy link
Collaborator

tilgovi commented Feb 16, 2021

Am I correct that this is just giving each worker an environment variable telling it which GPU to use, and ensuring that this stays balanced as workers start and stop?

@benoitc
Copy link
Owner

benoitc commented Feb 16, 2021 via email

@vigneshwaran-nv-10329
Copy link
Author

Am I correct that this is just giving each worker an environment variable telling it which GPU to use, and ensuring that this stays balanced as workers start and stop?

Yes, it is.

@vigneshwaran-nv-10329
Copy link
Author

Oh I misread the code then. so yes it can probably be closed. though I’m still interested about the use case it is trying to solve. If it’s a generic pattern.
On Wed 17 Feb 2021 at 00:23, Randall Leeds @.***> wrote: Am I correct that this is just giving each worker an environment variable telling it which GPU to use, and ensuring that this stays balanced as workers start and stop? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#2511 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADRIQT6AJQ4VSYELMMYK3S7L455ANCNFSM4XCRCIDA .
-- Sent from my Mobile

Use case: Deploying a model across multiple GPUs with multiple workers.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@mathematicalmichael
Copy link

+1 for the use-case. minimal image / video processing apps that can distribute workloads to multiple GPUs is niche for now, but potentially very useful.

@tilgovi
Copy link
Collaborator

tilgovi commented Aug 20, 2021

As I explained above, it should be possible to use server hooks to implement something like this. Perhaps someone can post an example.

If it's possible to articulate a more general description of the problem, maybe we could implement it in core.

This implementation uses some environment variables that specifically refer to GPUs, but the feature is actually just a balancing feature. It could easily be implemented just by having each worker have an ordinal number.

Does someone want to make an attempt? I'm thinking a minheap / priority queue of numbers [0, num workers). That's maybe not sufficient because I don't think our current strategy for replacing workers when the worker count changes would prioritize killing the highest numbered workers. There's probably other nuance I'm missing.

Anyway, someone should try to implement this with server hooks so that people who want this can have it now without it being in core. And if someone wants to articulate the problem statement and algorithm more generally, that would be great.

Sorry, something went wrong.

@benoitc benoitc closed this May 7, 2023
@mathematicalmichael
Copy link

mathematicalmichael commented May 25, 2023

fwiw, I took a stab at this with chatGPT. Here's its un-edited response:


The Gunicorn maintainer is proposing to use server hooks to balance GPU assignment among workers, rather than hard-coding GPU identifiers into environment variables. This approach leverages the lifecycle of Gunicorn workers to achieve load balancing.

Here is a basic example of how you might use Gunicorn server hooks to balance GPU assignment among workers:

import os
import heapq

gpu_heap = list(range(0, num_gpus))  # replace num_gpus with the number of GPUs

def pre_fork(server, worker):
    # Grab a GPU identifier from the heap before a worker is forked
    if gpu_heap:
        gpu_id = heapq.heappop(gpu_heap)
        worker.gpu_id = gpu_id

def post_fork(server, worker):
    # Set the GPU identifier as an environment variable after a worker is forked
    os.environ['CUDA_VISIBLE_DEVICES'] = str(worker.gpu_id)

def worker_exit(server, worker):
    # When a worker exits, return its GPU identifier to the heap
    heapq.heappush(gpu_heap, worker.gpu_id)
    del os.environ['CUDA_VISIBLE_DEVICES']  # Optionally clear the environment variable

# Then in your Gunicorn config file, add:
# worker_class = 'path.to.your.worker.class'
# on_starting = pre_fork
# post_fork = post_fork
# worker_exit = worker_exit

This script will make use of Python's built-in heap data structure (implemented as a binary heap), which is essentially a priority queue that will always give the minimum value (in this case, the lowest GPU identifier) when popped.

Please note that the workers do not directly share memory, so the gpu_heap in this example would need to be shared across processes or stored in a database that is accessible by all worker processes.

Moreover, this is a high-level example and your real-world use case may require additional complexity. For instance, handling cases when there are more workers than GPU devices, managing GPU memory usage more closely, or handling errors in GPU assignment.

Finally, please be aware that os.environ changes will not propagate to child processes after they have been forked, so setting the CUDA_VISIBLE_DEVICES variable in the post_fork function will only affect the worker process in which it is set. It's usually more straightforward to set this environment variable before starting the Gunicorn server. The example provided is for illustrating the concept.


So I asked it to try addressing those points...


The major challenge is that Gunicorn workers don't directly share memory, and Python multiprocessing utilities like multiprocessing.Value and multiprocessing.Array aren't designed for non-numeric and non-array data structures, like heap.

We can address this limitation by using a third-party library such as Redis, which is an in-memory data structure store. It will act as our shared memory for the GPU heap. Here's an improved script that uses Redis as the shared memory and provides handling for more edge cases:

import os
import redis
from heapq import heappush, heappop

# Connect to Redis server. Make sure the Redis server is running.
r = redis.Redis(host='localhost', port=6379, db=0)

# Initialize GPU heap in Redis.
num_gpus = 4  # replace this with the number of your GPUs
gpu_heap_key = 'gpu_heap'
if not r.exists(gpu_heap_key):
    r.rpush(gpu_heap_key, *range(num_gpus))

def pre_fork(server, worker):
    # Grab a GPU identifier from the heap before a worker is forked.
    # If no GPU is available, assign None.
    worker.gpu_id = None
    if r.llen(gpu_heap_key) > 0:
        worker.gpu_id = int(r.lpop(gpu_heap_key))

def post_fork(server, worker):
    # Set the GPU identifier as an environment variable after a worker is forked.
    # If no GPU is available, this will not set the environment variable.
    if worker.gpu_id is not None:
        os.environ['CUDA_VISIBLE_DEVICES'] = str(worker.gpu_id)

def worker_exit(server, worker):
    # When a worker exits, return its GPU identifier to the heap.
    # If the worker did not have a GPU, this does nothing.
    if worker.gpu_id is not None:
        r.rpush(gpu_heap_key, worker.gpu_id)
        del os.environ['CUDA_VISIBLE_DEVICES']  # Optionally clear the environment variable

# Then in your Gunicorn config file, add:
# worker_class = 'path.to.your.worker.class'
# on_starting = pre_fork
# post_fork = post_fork
# worker_exit = worker_exit

This script connects to a local Redis server and uses a Redis list as the GPU heap. It pops a GPU ID from the list when a worker is forked, and pushes the GPU ID back into the list when the worker exits.

If there are no GPUs available in the heap, the worker's gpu_id attribute is set to None, and the CUDA_VISIBLE_DEVICES environment variable is not set.

Please note that you need to install the redis Python package (with pip install redis or conda install redis) and have a running Redis server to use this script.

In addition, the workers might need additional configuration to use the specific GPU based on the 'CUDA_VISIBLE_DEVICES' variable in their own context.

Again, this is still a simplified example. In a production environment, you might need to handle additional complexities, such as GPU memory management, GPU utilization balancing, error handling and more.


Hopefully this is helpful for anyone coming across this in the future. I certainly learned quite a lot looking back on this PR and trying to understand Server Hooks.

@tilgovi
Copy link
Collaborator

tilgovi commented Dec 28, 2023

Hah. That's very cool! I think the first example is maybe correct, and ChatGPT expressed too much doubt here. Using the pre_fork hook means the priority queue is in arbiter memory. Passing the value on the worker instance itself is a great way to get the information from the arbiter into the worker. I think the environment variable is never set in the arbiter, so doesn't need to be cleared, but otherwise ChatGPT got it right 😆.

@mathematicalmichael
Copy link

😆 what a wild ride it's been, eh?
i think this was one of the earliest examples of me genuinely learning something new with it, as this issue stuck around in my mind for years and I re-encountered it with serving a stable-diffusion app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants