Skip to content

[WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies #1643

@ardeal

Description

@ardeal

Hi,

My environment:
Windows 10
python 3.8.5
CPU 10700K + 16GB RAM
GPU 3060Ti (8GB memory)
CUDA 11.0.3_451.82_win10
numpy 1.19.3
torch 1.7.1+cu110
torchvision 0.8.2+cu110

on master branch, follow the section at: https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data, and modify batch-size = 2 on my 3060Ti(8G memory). I got the following issue:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "D:\code_python\har_hailiang\yolov3\train.py", line 12, in <module>
    import torch.distributed as dist
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
python-BaseException

Is the issue related with CUDA or GPU memory size?

Thanks and Best Regards,
Ardeal

Activity

ardeal

ardeal commented on Jan 6, 2021

@ardeal
Author

OOOOOOOOOOOOOOOOOO
I solved the issue by change nw = 1.
in the code, nw = min([os.cpu_count(), batch_size if batch_size > 1 else 0, 8]) # number of workers
if nw = 8 that mean 8 CPU core will take part in the work, it needs much RAM.
so, it works if we decrease nw to 1.
OOOOOOOOOO
:)

github-actions

github-actions commented on Feb 6, 2021

@github-actions

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tufail117

tufail117 commented on Feb 20, 2021

@tufail117

Any update on this? I am also facing the same issue. Have tried many things for the last 3 days, but no success.

tufail117

tufail117 commented on Feb 20, 2021

@tufail117

Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

mondrasovic

mondrasovic commented on Mar 5, 2021

@mondrasovic

Well, i managed to resolve this.
open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

This works, but only temporarily. Nowadays I am facing the problem of encountering a crash after few hours of training. It usually happens at the beginning of the epoch, when it is loading.

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 114, in _main

    prepare(preparation_data)
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Program Files\Python37\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Program Files\Python37\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Program Files\Python37\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "E:\projects\siamfc\src\train.py", line 13, in <module>
    import torch
  File "E:\venvs\general\lib\site-packages\torch\__init__.py", line 123, in <module>
    raise err
OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "E:\venvs\general\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies.
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x0000025934FA8048>
Traceback (most recent call last):
  File "E:\venvs\general\lib\site-packages\torch\utils\data\dataloader.py", line 1324, in __del__
    self._shutdown_workers()
  File "E:\venvs\general\lib\site-packages\torch\utils\data\dataloader.py", line 1291, in _shutdown_workers
    if self._persistent_workers or self._workers_status[worker_id]:
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute '_workers_status'

My environment:

  • Windows 10
  • NVidia CUDA 11.1
  • Python 3.7.9 (tags/v3.7.9:13c94747c7, Aug 17 2020, 18:58:18) [MSC v.1900 64 bit (AMD64)] on win32
  • torch==1.8.0+cu111
  • torchvision==0.9.0+cu111
  • numpy==1.19.5

An interesting and at the same time the reproducible crash happened when I loaded the Microsoft Teams application. Even MS Teams reported an exception regarding virtual memory. No other app stopped working. Thus, MS Teams and PyTorch training became "mutually exclusive". After I applied the trick mentioned above, the problem remains only on the PyTorch side, and only sometimes. A lot of ambiguous words, I know, but that's how it is.

XuChang2020

XuChang2020 commented on Apr 30, 2021

@XuChang2020

1.try counting down the num_workers to 1or 0.
2.try modifying batch-size = 2 or 1.
Hope to help u.

PonyPC

PonyPC commented on Jun 9, 2021

@PonyPC

reduce number of workers will reduce train speed efficiently.

krisstern

krisstern commented on Sep 29, 2021

@krisstern

I was having the same error thrown with yolov5, which was fixed by changing the number of workers nw to 4 manually in the "datasets.py" file.

PonyPC

PonyPC commented on Sep 29, 2021

@PonyPC
glenn-jocher

glenn-jocher commented on Sep 29, 2021

@glenn-jocher
Member

@ardeal @krisstern @PonyPC you can set dataloader workers during training, i.e.:

python train.py --workers 16

https://github.com/ultralytics/yolov5/blob/76d301bd21b4de3b0f0d067211da07e6de74b2a0/train.py#L454

It seems like a lot of windows users are encountering this problem, but as @PonyPC mentioned reducing workers will generally also result in slower training. Are you guys encountering this during DDP or single-GPU training?

EDIT: just realized this is YOLOv3 repo and not YOLOv5. I would strongly encourage all users to migrate to YOLOv5, which is much better maintained. It's possible this issue is already resolved there.

18 remaining items

glenn-jocher

glenn-jocher commented on Nov 24, 2022

@glenn-jocher
Member

@szan12 i.e. python train.py --workers 4

bit-scientist

bit-scientist commented on Jan 27, 2023

@bit-scientist

I had the same error on win 10 today and following

open "advanced system setting". Go to the advanced tab then click settings related to performance.
Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

didn't help. Then, I suddenly remembered that I had installed cuda 11.7 along with already exisiting cuda 11.3 and 11.2 version. I had moved up lib and libvvp paths variables in system variables at that time. Therefore, decided to install packages )related to cuda 11.7) with conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia, reversed the above quoted process (select automatically...), restarted the PC and now it's working well.

kv1830

kv1830 commented on Apr 5, 2023

@kv1830

@glenn-jocher Unfortunately I don't think that this is something that can be fixed within yolov5.

This is an issue with CUDA and pytorch DLLs. My 'fix' just changes some flags on the DLLs to make them allocate less memory. This likely would be a job for NVidia to fix the flags on their CUDA DLLs (eg, cusolver64_*.dll in CUDA release). Perhaps 'pytorch' could help some as well, since they also package some of these (eg, caffe2_detectron_ops_gpu.dll)... although they use NVidia tools to do this, so the blame probably falls back to NVidia.

Even with my changes to these flags, these DLLs still reserve a whole lot more memory than they actually use. I don't know who is to blame, and since my flag changes got me going I'm not digging further into it.

edit: I went ahead and submitted the info as a 'bug report' to NVIDIA. Whether or not anything happens with it, or any of the appropriate people at NVIDIA ever see it, who knows? But maybe they'll pick it up and do something about it.

hello, this problem may be solved now!
my environments:
torch 1.13.1+cu117
torchvision 0.14.1+cu117
cuda: 11.8
cudnn: 8.8.1.3_cuda11
or:
cuda: 12.1
cudnn: 8.8.1.3_cuda12

I use yolov5-6.2, --batch-size 16 --workers 16, the virtual memory it need is much less than before! (It need more than 100GB before)

image

image

why I use torch 1.13.1+cu117 and cuda 11.8?
Actually I try with torch 2.0+cu118 and cuda 11.8(or cuda12.1), but something wrong with the amp, so I change to torch 1.13.1+cu117 firstly, and it works(cuda11.8 and cuda 12.1 both work), so I don't want to try cuda11.7 any more ~~

francescobodria

francescobodria commented on May 24, 2023

@francescobodria

I solved increasing the page file limit of windows

glenn-jocher

glenn-jocher commented on Nov 9, 2023

@glenn-jocher
Member

Thank you for sharing your solution. It's great to hear that increasing the page file limit of Windows helped in resolving the issue. It seems that managing the page file size effectively contributed to stability during the training process. If you encounter any more issues or have further questions, feel free to reach out.

kevinoldman

kevinoldman commented on Dec 13, 2023

@kevinoldman

1.try counting down the num_workers to 1or 0. 2.try modifying batch-size = 2 or 1. Hope to help u.

it works, but the entire training process became too slow, is there any better way to solve this? I got two days wasted on this.
Thank you.

ardeal

ardeal commented on Dec 13, 2023

@ardeal
Author

1.try counting down the num_workers to 1or 0. 2.try modifying batch-size = 2 or 1. Hope to help u.

it works, but the entire training process became too slow, is there any better way to solve this? I got two days wasted on this. Thank you.

There is no better solution.
This issue is related with your computer performance. If you would like to speed up the training, you have to improve your computer hardware performance. Such as, increase your memory, or use better GPU, or use server CPU and etc.

glenn-jocher

glenn-jocher commented on Dec 13, 2023

@glenn-jocher
Member

@ardeal hi there! It seems you've already tried the recommended solutions. As for improving speed, upgrading your hardware such as increasing memory, using a stronger GPU, or leveraging a server CPU may help expedite the training process. If you have further queries or need additional assistance, feel free to ask.

siddtmb

siddtmb commented on Mar 21, 2024

@siddtmb

It is not really related with the computer performance but rather the fact that:

  1. memory management on pytorch+windows sucks
  2. ultralytics dataloader is constantly memory leaking
  3. python multithreading sucks, and there are various things u can do to mitigate its issues (like using numpy arrays or torch tensors) which are not done in ultralytics dataloader, hence point 2.

Even on linux it will slowly eat up all of your memory and any swap partition you have till it drives training to a halt. good thing on linux is that you can just have oom killer and resume the training (though not an option on large datasets, those will still memory leak into oblivion). But on windows the only solution is to clear pagefile.sys with a hard reboot.

glenn-jocher

glenn-jocher commented on Mar 22, 2024

@glenn-jocher
Member

@siddtmb hi! Thanks for your insights. Memory management, particularly in a Windows environment, can indeed introduce challenges. We're continuously working on improving the efficiency of our data loader and overall memory usage within YOLOv3 and appreciate your feedback.

For mitigating memory leaks or high memory usage issues:

  • Ensuring the latest version of PyTorch is used can sometimes alleviate memory management issues, as improvements and bug fixes are regularly released.
  • Experimenting with reducing --num-workers and --batch-size in your training command may provide immediate relief from memory pressure, though at the expense of training speed.
  • Utilizing torch.utils.data.DataLoader with pin_memory=True and carefully managing tensor operations can help in some situations.

We recognize the importance of efficient memory usage and are committed to making improvements. Contributions and pull requests are always welcome if you have suggestions or optimizations to share with the community. Your feedback is valuable in guiding those efforts. Thank you for bringing this to our attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    StaleStale and schedule for closing soonquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @PonyPC@ardeal@mondrasovic@cobryan05@Neltherion

        Issue actions

          [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\ProgramData\Anaconda3\lib\site-packages\torch\lib\caffe2_detectron_ops_gpu.dll" or one of its dependencies · Issue #1643 · ultralytics/yolov3