Closed
Description
Problem: How to run ds_pretrain_gpt2-zero3.sh in docker env using 2 nodes?
Current platform:
- I have 2 nodes (100.3.8.100 & 100.3.8.68). For each node (with 8 V100 GPUs), the docker environments are all set and I can run ds_pretrain_gpt2-zero3.sh in local sucessfully using the following command:
docker run -it --gpus all -v /home/libn/DeepSpeedExamples:/workspace deepspeed:latest
- I installed pdsh in each node:
For 100.3.8.100, I install pdsh in docker environments (deepspeed:latest):
For 100.3.8.68, I install pdsh in docker env (deepspeed:latest).
My modification:
I want to use 2 nodes and I modify the ds_pretrain_gpt2-zero3.sh:
- change
run_cmd
to:
run_cmd=deepspeed --hostfile=myhostfile pretrain_gpt2.py ${@:2} ${full_options}
- add myhostfile:
100.3.8.100 slots=2
100.3.8.68 slots=2
My test and error:
Test: in 100.3.8.100 server and docker env (deepspeed:latest)
docker run -it --gpus all -v /home/libn/DeepSpeedExamples:/workspace deepspeed:latest
cd megatron/Megatron-LM-v1.1.5-ZeRO3/exmaples
bash ds_pretrain_gpt2-zero3.sh
Error:
100.3.8.100: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/exmaples: No such file or directory
100.3.8.100: bash: /opt/conda/bin/python: No such file or directory
100.3.8.68: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/exmaples: No such file or directory
100.3.8.68: bash: /opt/conda/bin/python: No such file or directory
Any suggestions? Thanks!
Activity
[-]How to launch jobs with Docker env using multiple nodes in DeepSpeed?[/-][+][Question] How to launch jobs with Docker env using multiple nodes in DeepSpeed?[/+]tjruwase commentedon Mar 2, 2023
@bing0037, did you verify that those paths are valid in both containers?
bing0037 commentedon Mar 3, 2023
Thanks for your reply.
Here are all my tests and results:
Test 1: test docker containers in node1(100.3.8.100) and node2 (100.3.8.68):
node1:
Result:
node2: the code directory in host is different from node1, but is the same in docker env.
Result:
Test2: training using 2 nodes
In node1 (100.3.8.100), run ds_pretrain_gpt2-zero3.sh:
hostfile:
Result:
bing0037 commentedon Mar 7, 2023
@tjruwase hi, do you have any suggestions?
zincnode commentedon Mar 8, 2023
Hi, @bing0037
I did something similar before (testing ZeRO on a multi-node docker env). Here are some thoughts or suggestions:
The server (outside the container, hereinafter referred to as: Host) and docker env (inside the container, hereinafter referred to as: Container) can be considered as two completely independent environments (or the operating system, you even regard the Container as a virtual machine ), which is very imprecise but better understood.
I guess the IP you mentioned (100.3.8.XXX) is the IP of Host. When you start the docker container, if you do not specify the network parameter (
--net
), the default is to use the Bridge network. At this point, there will be two IPs, one is Host IP (100.3.8.XXX), and the other is Container IP (typically: 172.XXX.XXX.XXX). The two IPs point to the two independent environments mentioned above. So when you specify the Deepspeed hostfile, you should actually use the Container IP instead of the Host IP. When you use Container IP, Deepspeed will establish an ssh connection, read files and execute scripts on your Host, but there is no/workspace
and/opt/conda/bin/python
(These paths are for Container) , which is the error message. However, the current Container IP specified in the hostfile should still not work, continue to look down.As you know, distributed training of Deepspeed needs to use ssh, which requires communication between nodes (in short, ping). At present, you start the container using the Bridge network, and the two containers should be unable to ping (both containers use 172.XXX.XXX.XXX). There are roughly two solutions:
A typical GPU cluster will use InfiniBand to accelerate communication between multiple nodes. I'm not sure if the cluster you're using has. If using InfiniBand, add
--privilege
when starting the container. This will make InfiniBand available inside the container. This worked for me, of course this requires making sure InfiniBand is available on the Host.FYI, this is how I use docker while testing.
Start the container
Enter the container
docker exec -it AAAAA bash
Exit the container(The container is still running after exiting, unless you execute
docker container stop AAAAA
)exit
The above thoughts are based on the information you provided and my experience, there may be inaccuracies or mistakes.😂
Additional materials:
bnuzhanyu commentedon Mar 9, 2023
Thanks for your answer, it seems a reasonable way to solve the problem, but I wonder is there any way to run cmds (different in rank/role/port) in all docker env and then they start communication so that ssh is not required, like tensorflow's distributed training. By this I can run the distributed training in kubernetes.
loadams commentedon Aug 21, 2023
This seems like the best answer so far, and the issue is fairly stale. Closing for now, if folks have other suggestions, please post here, if you have other questions, please open an issue so we can see it and reply. Thanks!
ray-008 commentedon Nov 23, 2023
Official document creates overlay network:
use-an-overlay-network-for-standalone-containers
Then the two containers can realize cross-machine communication through this network, and the rest is password-free login, environment configuration, etc.