Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Multi-GPU training error #735

Open
HuangQinJian opened this issue May 1, 2019 · 5 comments
Open

Multi-GPU training error #735

HuangQinJian opened this issue May 1, 2019 · 5 comments

Comments

@HuangQinJian
Copy link

HuangQinJian commented May 1, 2019

❓ Questions and Help

image

The code got stuck?why?

@domarps
Copy link

domarps commented May 3, 2019

Seems to be an issue with apex. Did you install within a container or bare metal? Either way, this could be due to a previous install on your system.

@HuangQinJian
Copy link
Author

Seems to be an issue with apex. Did you install within a container or bare metal? Either way, this could be due to a previous install on your system.

I installed like this:

image

I still do not know the error.Could you help me?

@Pluto1314
Copy link

hello,when i use :
CUDA_VISIBLE_DEVICES="3,4" python -m torch.distributed.launch --nproc_per_node 2 train_net.py
i meet some trouble, like this:
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called without an active exception
How can I solve it,thanks a lot!

@alongGS
Copy link

alongGS commented Jan 3, 2020

hello,when i use :
CUDA_VISIBLE_DEVICES="3,4" python -m torch.distributed.launch --nproc_per_node 2 train_net.py
i meet some trouble, like this:
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called without an active exception
How can I solve it,thanks a lot!

permission problem? maybe

@ErwinCheung
Copy link

export NGPUS=4
CUDA_VISIBLE_DEVICES=“0,1,2,3” python -m torch.distributed.launch --nproc_per_node=$NGPUS /path_to_maskrcnn_benchmark/tools/train_net.py --config-file "path/to/config/file.yaml" MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN images_per_gpu x 1000

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants