Skip to content

Consume double GPU memory while using pytorch built-in amp module starting from the second epoch #610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tkianai opened this issue Aug 3, 2020 · 10 comments
Labels
bug Something isn't working

Comments

@tkianai
Copy link
Contributor

tkianai commented Aug 3, 2020

I have encountered this problem while running python -m torch.distributed.launch --nproc_per_node 8 train.py --batch-size 128 --data coco.yaml --cfg yolov5s.yaml --weights '' with the code of the latest version.

The logs printed as follows:

Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/299     3.54G    0.0874    0.0744     0.081    0.2428        10       640: 100%|█████████████████| 925/925 [05:45<00:00,  2.68it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100%|███████| 40/40 [01:31<00:00,  2.28s/it]
                 all       5e+03    3.63e+04      0.0336     0.00589     0.00874     0.00232

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     1/299     8.08G   0.07261    0.0743   0.06651    0.2134        37       640: 100%|█████████████████| 925/925 [05:41<00:00,  2.71it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100%|███████| 40/40 [01:18<00:00,  1.95s/it]
                 all       5e+03    3.63e+04       0.143      0.0414      0.0393      0.0145

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     2/299     8.08G   0.06912   0.06453   0.06019    0.1938       211       640:   2%|| 15/925 [00:10<05:50,  2.60it/s]

The GPU states listed as follows:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 43%   72C    P2   187W / 250W |   8409MiB / 12196MiB |     69%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 49%   80C    P2   207W / 250W |   4083MiB / 12196MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 48%   78C    P2   235W / 250W |   4083MiB / 12196MiB |     88%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:08:00.0 Off |                  N/A |
| 51%   83C    P2   252W / 250W |   4083MiB / 12196MiB |     88%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN Xp            Off  | 00000000:0C:00.0 Off |                  N/A |
| 49%   80C    P2   260W / 250W |   4083MiB / 12196MiB |     76%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN Xp            Off  | 00000000:0D:00.0 Off |                  N/A |
| 40%   68C    P2   235W / 250W |   4083MiB / 12196MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN Xp            Off  | 00000000:0E:00.0 Off |                  N/A |
| 49%   81C    P2   225W / 250W |   4083MiB / 12196MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN Xp            Off  | 00000000:0F:00.0 Off |                  N/A |
| 52%   84C    P2   138W / 250W |   4083MiB / 12196MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
@tkianai tkianai added the bug Something isn't working label Aug 3, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2020

Hello @tkianai, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

  • Cloud-based AI systems operating on hundreds of HD video streams in realtime.
  • Edge AI integrated into custom iOS and Android apps for realtime 30 FPS video inference.
  • Custom data training, hyperparameter evolution, and model exportation to any destination.

For more information please visit https://www.ultralytics.com.

@glenn-jocher
Copy link
Member

@tkianai I can't comment on this as I don't often employ multi-gpu, but I believe excess device 0 memory consumption has been common for a while. I don't know if this is specific to this repository or pytorch DDP in general.

I would probably not classify as a bug.

@tkianai
Copy link
Contributor Author

tkianai commented Aug 4, 2020

@glenn-jocher Hi, thanks for your explanation. I just wonder amp(pytorch built-in version) do not help much. It's slower, and it eliminates little gpu memory.

My environment(ubuntu18.04, pytorch 1.6.0, cuda 10.2, TitanXp)

Compared to commit 48e15be498ab6f3335743b5df7fbd02edd068160 do not use mix_precision training:

0/299     1.77G   0.08957   0.08007   0.08078    0.2504        23       640: 100%|██████████████| 925/925 [05:11<00:00,  2.97it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100%|████| 40/40 [01:25<00:00,  2.14s/it]
                 all       5e+03    3.63e+04      0.0609     0.00288     0.00829     0.00234

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     1/299     8.31G    0.0737   0.07986   0.06731    0.2209        21       640: 100%|██████████████| 925/925 [05:01<00:00,  3.07it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100%|████| 40/40 [01:20<00:00,  2.00s/it]
                 all       5e+03    3.63e+04       0.126      0.0581      0.0457      0.0168

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     2/299     8.36G   0.07015   0.07251   0.06002    0.2027       188       640:   3%|| 26/925 [00:09<04:55,  3.04it/s]
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 45%   74C    P2   231W / 250W |   8609MiB / 12196MiB |     73%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 51%   82C    P2    95W / 250W |   5413MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 49%   79C    P2    95W / 250W |   5423MiB / 12196MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:08:00.0 Off |                  N/A |
| 52%   83C    P2   261W / 250W |   5415MiB / 12196MiB |     75%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN Xp            Off  | 00000000:0C:00.0 Off |                  N/A |
| 50%   80C    P2   100W / 250W |   5415MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN Xp            Off  | 00000000:0D:00.0 Off |                  N/A |
| 41%   70C    P2   240W / 250W |   5413MiB / 12196MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN Xp            Off  | 00000000:0E:00.0 Off |                  N/A |
| 50%   81C    P2    90W / 250W |   5413MiB / 12196MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN Xp            Off  | 00000000:0F:00.0 Off |                  N/A |
| 52%   84C    P2   249W / 250W |   5413MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

@glenn-jocher
Copy link
Member

@tkianai oh that's odd. In our experiments we found native amp with 1.6 to use a little less GPU ram, which allowed us to increase batch size maybe 10%, allowing for a bit of speed up (maybe 5%-10%). We have not tested on multi-gpu however.

It looks like you see similar GPU memory reduction as we saw, perhaps this would allow you to up your batch size some. I don't know if there is anything you can do about the device 0 mem usage, but that seems to be a separate issue, which may or may not have a solution. You might want to raise an issue on the pytorch repo for that one.

@NanoCode012
Copy link
Contributor

NanoCode012 commented Aug 4, 2020

A bit excess device 0 should be expected (<1-2GB). I think this is related to the test performed only on GPU 0.

I did my own quick test using your command. Numbers taken at second epoch.

Branch GPU 0 (GB) GPU 1-7 (GB)
My master branch (Pre 1.6) 8 4
My muti branch (Post 1.6) 10 4

@tkianai , I recommend using --sync-bn since the results aren't looking too good.

Edit: See below comment for better visuals. The above table may be skewed.

@glenn-jocher
Copy link
Member

One minor cause is the EMA, which is resident on device 0, but this should only be a few hundred MB at most. For v5x, with 90M parameters, even as FP32 that would be 4 bytes * 90M = 360MB. The rest I'm not really sure.

Testing is device 0, but once the test.py function exits all the cuda variables should be cleared there and no longer consuming memory. If @tkianai is saying that epoch 0 is different from epoch >0 though, then the call to test.py could be the differentiator. In that sense it is suspect.

@tkianai
Copy link
Contributor Author

tkianai commented Aug 4, 2020

@NanoCode012 Yes, using multi-gpu training with the same 300 epochs would cause almost ~0.5 mAP decline. I just test this case when I train with multi-gpu with amp. --sync-bn would be a good way to enhance performance.

@NanoCode012
Copy link
Contributor

NanoCode012 commented Aug 4, 2020

Testing is device 0, but once the test.py function exits all the cuda variables should be cleared there and no longer consuming memory.

Oh I see!

If @tkianai is saying that epoch 0 is different from epoch >0 though, then the call to test.py could be the differentiator. In that sense it is suspect.

On 1.6,

Optimizer Epoch 1 Train Epoch 1 Test Epoch 2 Train Epoch 2 Test Epoch 3 Train Epoch 3 Test
image image image image image image image

@glenn-jocher , hope this provides better visualization. @tkianai , do you get the same results?

Edit: Add table

@tkianai
Copy link
Contributor Author

tkianai commented Aug 4, 2020

@glenn-jocher @NanoCode012 Hi, It seems the test stage would increase the GPU memory in device 0. I use --notest flag and this do not happen anymore from epoch 1.

  • Using amp(pytorch built-in)

python -m torch.distributed.launch --nproc_per_node 8 train.py --batch-size 128 --data coco.yaml --cfg yolov5s.yaml --weights '' --notest

The printed logs listed as follows:

Starting training for 300 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     0/299      2.9G   0.08759    0.0743   0.08098    0.2429        10       640: 100%|██████████████████| 925/925 [06:14<00:00,  2.47it/s]

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     1/299     2.94G   0.07263   0.07433   0.06655    0.2135         9       640: 100%|██████████████████| 925/925 [06:03<00:00,  2.55it/s]

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     2/299     2.94G   0.06974   0.07265    0.0598    0.2022       153       640:   7%|█▍                 | 69/925 [00:28<05:34,  2.56it/s]
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 45%   72C    P2   214W / 250W |   3457MiB / 12196MiB |     79%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 53%   83C    P2   206W / 250W |   3423MiB / 12196MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 51%   80C    P2   232W / 250W |   3423MiB / 12196MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:08:00.0 Off |                  N/A |
| 53%   83C    P2   213W / 250W |   3423MiB / 12196MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN Xp            Off  | 00000000:0C:00.0 Off |                  N/A |
| 52%   82C    P2   128W / 250W |   3423MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN Xp            Off  | 00000000:0D:00.0 Off |                  N/A |
| 43%   69C    P2    96W / 250W |   3423MiB / 12196MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN Xp            Off  | 00000000:0E:00.0 Off |                  N/A |
| 53%   83C    P2   157W / 250W |   3423MiB / 12196MiB |     82%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN Xp            Off  | 00000000:0F:00.0 Off |                  N/A |
| 53%   83C    P2   231W / 250W |   3423MiB / 12196MiB |     88%      Default |
+-------------------------------+----------------------+----------------------+
  • Using float32

python -m torch.distributed.launch --nproc_per_node 8 train.py --batch-size 128 --data coco.yaml --cfg yolov5s.yaml --weights '' --notest

The printed logs listed as follows:

Starting training for 300 epochs...

     0/299     1.77G   0.08925   0.08011    0.0807    0.2501        23       640: 100%|██████████████████| 925/925 [05:16<00:00,  2.93it/s]

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     1/299      5.2G   0.07356   0.07901   0.06713    0.2197        33       640: 100%|██████████████████| 925/925 [05:06<00:00,  3.01it/s]

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
     2/299      5.2G   0.07061   0.07835   0.06183    0.2108       222       640:  10%|█▊                 | 89/925 [00:32<04:31,  3.08it/s]
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:04:00.0 Off |                  N/A |
| 46%   73C    P2   166W / 250W |   5637MiB / 12196MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:06:00.0 Off |                  N/A |
| 53%   83C    P2   211W / 250W |   5603MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+
|   2  TITAN Xp            Off  | 00000000:07:00.0 Off |                  N/A |
| 51%   80C    P2   212W / 250W |   5413MiB / 12196MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  TITAN Xp            Off  | 00000000:08:00.0 Off |                  N/A |
| 53%   83C    P2   194W / 250W |   5413MiB / 12196MiB |     64%      Default |
+-------------------------------+----------------------+----------------------+
|   4  TITAN Xp            Off  | 00000000:0C:00.0 Off |                  N/A |
| 52%   82C    P2   259W / 250W |   5415MiB / 12196MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   5  TITAN Xp            Off  | 00000000:0D:00.0 Off |                  N/A |
| 44%   70C    P2   221W / 250W |   5413MiB / 12196MiB |     77%      Default |
+-------------------------------+----------------------+----------------------+
|   6  TITAN Xp            Off  | 00000000:0E:00.0 Off |                  N/A |
| 53%   82C    P2   263W / 250W |   5413MiB / 12196MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   7  TITAN Xp            Off  | 00000000:0F:00.0 Off |                  N/A |
| 57%   84C    P2   285W / 250W |   5413MiB / 12196MiB |     83%      Default |
+-------------------------------+----------------------+----------------------+

It eliminates the gpu memory from 5400Mb to 3400Mb, which works well. But, it looks like using a little much longer time to train(5 minutes to 6 minutes per epoch)

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 6, 2020

@tkianai @NanoCode012 I'm updating this issue with comments from a recent PR.

for the GPU memory issue that was mentioned in #610 , tkianai and I did test and we found out that it spiked after epoch 1 test #610 (comment) and #610 (comment) . Specifically saying the issue lies in test as --notest removes the issue.

I am not sure why it does not happen for you. Maybe it is because your maximum memory is 16 GB and it is already training at 14 GB (near its maximum)?

I will run my own test with 2 GPU as comparison. I think this should be in a separate Issue.

Edit: Add table

Command

python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 64 --data coco.yaml --cfg yolov5s.yaml --weights ''
GPU Train 1 Train 3 Test 5
GPU 0 6145MiB 6483MiB 6483MiB
GPU 1 5965MiB 6295MiB 6295MiB

Edit 2: Due to losing track of time, I did not get to track down the inbetweens, however, we see that it doesn't spike as badly. I am confused why it spiked before. Was it because of 8 GPUs?

Originally posted by @NanoCode012 in #500 (comment)

And my own observations was no additional GPU memory consumption on device 2 when using 2 GPU training, during either training or testing on epoch 0 or any subsequent epochs:

nvidia-smi from epoch 0 and later to test high device 0 memory usage issue.
python -m torch.distributed.launch --nproc_per_node 2 train.py --img 640 640 --batch 32 --cfg yolov5x.yaml --weights '' --bucket ult/coco --name $n --data coco.yaml

Epoch 0 train:

glenn_jocher_ultralytics_com@instance-7:~$ nvidia-smi
Thu Aug  6 17:50:26 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P0   210W / 300W |  15020MiB / 16130MiB |     84%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   54C    P0   223W / 300W |  14708MiB / 16130MiB |     88%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1806      C   /opt/conda/bin/python                      15009MiB |
|    1      1807      C   /opt/conda/bin/python                      14697MiB |
+-----------------------------------------------------------------------------+

epoch 0 test:

glenn_jocher_ultralytics_com@instance-7:~$ nvidia-smi
Thu Aug  6 18:19:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    84W / 300W |  14172MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   40C    P0    56W / 300W |  14006MiB / 16130MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1806      C   /opt/conda/bin/python                      14159MiB |
|    1      1807      C   /opt/conda/bin/python                      13995MiB |
+-----------------------------------------------------------------------------+

epoch 1 train:

glenn_jocher_ultralytics_com@instance-7:~$ nvidia-smi
Thu Aug  6 18:21:59 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0   277W / 300W |  14996MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   55C    P0   263W / 300W |  14600MiB / 16130MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1806      C   /opt/conda/bin/python                      14981MiB |
|    1      1807      C   /opt/conda/bin/python                      14589MiB |
+-----------------------------------------------------------------------------+

Originally posted by @glenn-jocher in #500 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants