-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[Question] How to launch jobs with Docker env using multiple nodes in DeepSpeed? #2920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@bing0037, did you verify that those paths are valid in both containers? |
Thanks for your reply. Here are all my tests and results:Test 1: test docker containers in node1(100.3.8.100) and node2 (100.3.8.68):node1:
Result:
node2: the code directory in host is different from node1, but is the same in docker env.
Result:
Test2: training using 2 nodesIn node1 (100.3.8.100), run ds_pretrain_gpt2-zero3.sh:
Result:
|
@tjruwase hi, do you have any suggestions? |
Hi, @bing0037
The above thoughts are based on the information you provided and my experience, there may be inaccuracies or mistakes.😂 Additional materials: |
Thanks for your answer, it seems a reasonable way to solve the problem, but I wonder is there any way to run cmds (different in rank/role/port) in all docker env and then they start communication so that ssh is not required, like tensorflow's distributed training. By this I can run the distributed training in kubernetes. |
This seems like the best answer so far, and the issue is fairly stale. Closing for now, if folks have other suggestions, please post here, if you have other questions, please open an issue so we can see it and reply. Thanks! |
Official document creates overlay network: Then the two containers can realize cross-machine communication through this network, and the rest is password-free login, environment configuration, etc. |
Problem: How to run ds_pretrain_gpt2-zero3.sh in docker env using 2 nodes?
Current platform:
For 100.3.8.100, I install pdsh in docker environments (deepspeed:latest):
For 100.3.8.68, I install pdsh in docker env (deepspeed:latest).
My modification:
I want to use 2 nodes and I modify the ds_pretrain_gpt2-zero3.sh:
run_cmd
to:run_cmd=deepspeed --hostfile=myhostfile pretrain_gpt2.py ${@:2} ${full_options}
My test and error:
Test: in 100.3.8.100 server and docker env (deepspeed:latest)
Error:
Any suggestions? Thanks!
The text was updated successfully, but these errors were encountered: