-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Closed
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
RuntimeError: CUDA error: no kernel image is available for execution on the driver when use Pytorch 1.7 on Linux with RTX 3090 + ubuntun 20 + GPU driver 455.45 + CUDA 11.0
I am a skilled user of pytorch-gpu, recently I purchased an RTX 3090 server, but the bug with pytorch 1.7 and RT 3090 makes me mad. I try a lot of experiments to figure it out, but I failed. You can reproduce the bug when you do as follow
To Reproduce
Steps to reproduce the behavior:
- First I install the RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.1 (both are suggested by Nvidia official for the RTX 3090) on my host machine which has RTX 3090 GPU. When I use
nvidia-smi
the GPU information can be shown correctly. - Second, I build a docker container by an nvidia-docker official image on the docker hub (You can pull it by
docker pull nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04
). And install the pytorch 1.7. In this container I can usenvidia-smi
to check the GPU's information, and thetorch.cuda.is_available()=True
. I can also build a tensor on GPU bytorch.zeore(1).cuda
, and concate two torch.cuda.tensor bytorch.cat(a.cuda(),b.cuda())
. - But when I run my deep learning python script which can run correctly on RTX 2080 + ubuntu 20 + GPU driver 455 + CUDA 10.2 + cuDnn8 + pytorch1.7, the bug occurs. The information of the bug is RuntimeError: CUDA error: no kernel image is available for execution on the driver.
- I am pretty sure that my python script has no bugs, because I have run the same one on many different servers and environments (pytorch + RTX 2080/GTX 1080/Titan X/K80/Tesla V100). And it never occurs bugs.
- I have also tried different nvidia-docker offical images with CUDA 11.1 +pytorch 1.7 or CUDA 10.1 +pytorch 1.7. But it doesn't help.
- Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.0 + cuDnn8 + pytorch1.7 + python 3.7 the bug is RuntimeError: CUDA error: no kernel image is available for execution on the driver
- . Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.1 + cuDnn8 + pytorch1.7 + python 3.7 the bug is RuntimeError: CUDA error: no kernel image is available for execution on the driver
- Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 10.2 + cuDnn8 + pytorch1.7 + python 3.7 the bug is RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:116
4.Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.0 + cuDnn8 + pytorch1.7 + python 3.7 ,it also shows RTX 2090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
Expected behavior
I want to run a PyTorch script on RTX 3090 with ubuntu.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch Version (e.g., 1.0):1.7
- OS (e.g., Linux): Linux ubuntun 20
- How you installed PyTorch (
conda
,pip
, source):I have tried:
- pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
- pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html
- Build command you used (if compiling from source):
- Python version:3.7
- CUDA/cuDNN version:CUDA 11.0 +cuDNN 8
- GPU models and configuration:RTX 3090
- Any other relevant information:
Additional context
cc @ngimel
mfshiu, zhengli97, skifvideo, maltevb, harleyzhang and 30 morev-nhandt21 and richardrlrajkumar-subramv-nhandt21, richardrl and ttlzfhy
Metadata
Metadata
Assignees
Labels
module: cudaRelated to torch.cuda, and CUDA support in generalRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
Dingseewhole commentedon Dec 10, 2020
The bug picture is here
ngimel commentedon Dec 10, 2020
The error message on your screen is quite clear, your pytorch installation does not support sm_86 compute capability. That said, torch==1.7.0+cu110 supports 30xx gpus, so perhaps when you are installing torchvision or torchaudio, your pytorch version gets downgraded? When you are installing binary build of pytorch, your cuda or cudnn version don't matter, because pytorch uses its own.
cc @malfet.
malfet commentedon Dec 11, 2020
@Dingseewhole can you please see if the problem persists if you upgrade to PyTorch-1.7.1 that was released today?
Can you please re-run your training script with
CUDA_LAUNCH_BLOCKING=1
environment variable set to see what operation actually caused the exception.And can you let me know if
python -c "import torch;print(torch.max(torch.rand((30,30),device='cuda')))"
command works on your RTX-3090 system?Dingseewhole commentedon Dec 11, 2020
Thank you for your help!

I use the official PyTorch pip install method
install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
to build my pytorch environment. And after that, I didn't change my torch version or torchvision version or torchaudio version. So I guess my pytorch version was not get downgraded?Here is my
pip list
result , you can see the all packages version my python script using,malfet commentedon Dec 11, 2020
@Dingseewhole , try upgrading to 1.7.1 by using
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
Dingseewhole commentedon Dec 11, 2020
Thank you for your help. When I upgrade my pytorch 1.70 to 1.71, the bug disappeared.
Thanks!
gohguodong commentedon Dec 31, 2020
Hi, i am experiencing similar issue with the rtx3080. i use the command below to install the pytorch
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html
below are the details of my environment:
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.7.1+cu110
[pip3] torchvision==0.8.2+cu110
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.2.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] torch 1.7.1+cu110 pypi_0 pypi
[conda] torchvision 0.8.2+cu110 pypi_0 pypi
the code python -c "import torch;print(torch.max(torch.rand((30,30),device='cuda')))" works on my device.
however, the no kernel image error still persists in pytorch 1.7.1. FYI, i am using WSL2. any idea what went wrong?
YUyttendaele commentedon Jan 3, 2021
@gohguodong I'm not sure, but your nvidia driver may be incompatible with CUDA 11.0. Check https://docs.nvidia.com/deploy/cuda-compatibility/index.html for more info.
maltevb commentedon Jan 29, 2021
EDIT
Reinstalling the nvidia driver, cuda and torch as mentioned in @malfet did the trick for me. After that the nvidia-smi command worked and torch detected my GPUs again.
Hey guys,
i am also facing the same problem with my RTX 3090.
The specs are:
torch = 1.7.1
cuda = 11.2
nvidia driver = 460.32.03
torchvision = 0.8.2
Anybody has an idea how to fix it? Tried also torch = 1.7.0, but the error still occurs.
Would appreciate if you share your experience.
See the full error message below:
Traceback (most recent call last):
File "/home/user/20210129_instance_nn_tuned/years/code/solver.py", line 101, in train
y_predicted = model(x_train).flatten() # flatten to (N,) otherwise it has shape (N,1) and loss function throws a warning
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/20210129_instance_nn_tuned/years/code/nn_architecture.py", line 53, in forward
out = self.model(x)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: no kernel image is available for execution on the device
hszhoushen commentedon Feb 2, 2021
Hi, did you solve the problem? I have a similar configuration and with the same problem.
RTX3090
torch = 1.7.1
cuda = 11.1
nvidia driver = 460.32.03
torchvision = 0.8.2
linhduongtuan commentedon Feb 28, 2021
I am struggling with the same issue when my machine with RTX 3090 is running on Ubuntu 20.04, CUDA = 11.2, Nvidia-driver = 460.39, Pytorch from "pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html" (also try to using Nightly version) ==> Error messenger:
"RuntimeError: CUDA error: no kernel image is available for execution on the device"
One important thing, before my machine was updated automatically, everything was fine. Later, it is collapsed.

Furthermore, when I call nvidia-smi the message also indicates an error at "Volatile Uncorr. ECC ...." you can see my attach image. AND if it overcomes the first error above, GPU-Util is nearly 0-1%!!!!!!!
30 remaining items