Skip to content

RuntimeError: CUDA error: no kernel image is available for execution on the driver, when use pytorch 1.7 on linux with RTX 3090 #49161

@Dingseewhole

Description

@Dingseewhole

🐛 Bug

RuntimeError: CUDA error: no kernel image is available for execution on the driver when use Pytorch 1.7 on Linux with RTX 3090 + ubuntun 20 + GPU driver 455.45 + CUDA 11.0
I am a skilled user of pytorch-gpu, recently I purchased an RTX 3090 server, but the bug with pytorch 1.7 and RT 3090 makes me mad. I try a lot of experiments to figure it out, but I failed. You can reproduce the bug when you do as follow

To Reproduce

Steps to reproduce the behavior:

  1. First I install the RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.1 (both are suggested by Nvidia official for the RTX 3090) on my host machine which has RTX 3090 GPU. When I use nvidia-smi the GPU information can be shown correctly.
  2. Second, I build a docker container by an nvidia-docker official image on the docker hub (You can pull it by docker pull nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04). And install the pytorch 1.7. In this container I can use nvidia-smi to check the GPU's information, and the torch.cuda.is_available()=True. I can also build a tensor on GPU by torch.zeore(1).cuda, and concate two torch.cuda.tensor by torch.cat(a.cuda(),b.cuda()).
  3. But when I run my deep learning python script which can run correctly on RTX 2080 + ubuntu 20 + GPU driver 455 + CUDA 10.2 + cuDnn8 + pytorch1.7, the bug occurs. The information of the bug is RuntimeError: CUDA error: no kernel image is available for execution on the driver.
  4. I am pretty sure that my python script has no bugs, because I have run the same one on many different servers and environments (pytorch + RTX 2080/GTX 1080/Titan X/K80/Tesla V100). And it never occurs bugs.
  5. I have also tried different nvidia-docker offical images with CUDA 11.1 +pytorch 1.7 or CUDA 10.1 +pytorch 1.7. But it doesn't help.
  1. Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.0 + cuDnn8 + pytorch1.7 + python 3.7 the bug is RuntimeError: CUDA error: no kernel image is available for execution on the driver
  2. . Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.1 + cuDnn8 + pytorch1.7 + python 3.7 the bug is RuntimeError: CUDA error: no kernel image is available for execution on the driver
  3. Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 10.2 + cuDnn8 + pytorch1.7 + python 3.7 the bug is RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:116
    4.Use RTX 3090 + ubuntu 20 + GPU driver 455 + CUDA 11.0 + cuDnn8 + pytorch1.7 + python 3.7 ,it also shows RTX 2090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.

image

Expected behavior

I want to run a PyTorch script on RTX 3090 with ubuntu.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0):1.7
  • OS (e.g., Linux): Linux ubuntun 20
  • How you installed PyTorch (conda, pip, source):I have tried:
  1. pip install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
  2. pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu110/torch_nightly.html
  • Build command you used (if compiling from source):
  • Python version:3.7
  • CUDA/cuDNN version:CUDA 11.0 +cuDNN 8
  • GPU models and configuration:RTX 3090
  • Any other relevant information:

Additional context

cc @ngimel

Activity

Dingseewhole

Dingseewhole commented on Dec 10, 2020

@Dingseewhole
Author

image
The bug picture is here

added
module: cudaRelated to torch.cuda, and CUDA support in general
triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
on Dec 10, 2020
ngimel

ngimel commented on Dec 10, 2020

@ngimel
Collaborator

The error message on your screen is quite clear, your pytorch installation does not support sm_86 compute capability. That said, torch==1.7.0+cu110 supports 30xx gpus, so perhaps when you are installing torchvision or torchaudio, your pytorch version gets downgraded? When you are installing binary build of pytorch, your cuda or cudnn version don't matter, because pytorch uses its own.
cc @malfet.

malfet

malfet commented on Dec 11, 2020

@malfet
Contributor

@Dingseewhole can you please see if the problem persists if you upgrade to PyTorch-1.7.1 that was released today?
Can you please re-run your training script with CUDA_LAUNCH_BLOCKING=1 environment variable set to see what operation actually caused the exception.
And can you let me know if python -c "import torch;print(torch.max(torch.rand((30,30),device='cuda')))" command works on your RTX-3090 system?

Dingseewhole

Dingseewhole commented on Dec 11, 2020

@Dingseewhole
Author

The error message on your screen is quite clear, your pytorch installation does not support sm_86 compute capability. That said, torch==1.7.0+cu110 supports 30xx gpus, so perhaps when you are installing torchvision or torchaudio, your pytorch version gets downgraded? When you are installing binary build of pytorch, your cuda or cudnn version don't matter, because pytorch uses its own.
cc @malfet.

Thank you for your help!
I use the official PyTorch pip install method install torch==1.7.0+cu110 torchvision==0.8.1+cu110 torchaudio===0.7.0 -f https://download.pytorch.org/whl/torch_stable.html to build my pytorch environment. And after that, I didn't change my torch version or torchvision version or torchaudio version. So I guess my pytorch version was not get downgraded?
Here is my pip list result , you can see the all packages version my python script using,
image

malfet

malfet commented on Dec 11, 2020

@malfet
Contributor

@Dingseewhole , try upgrading to 1.7.1 by using
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Dingseewhole

Dingseewhole commented on Dec 11, 2020

@Dingseewhole
Author

@Dingseewhole , try upgrading to 1.7.1 by using
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Thank you for your help. When I upgrade my pytorch 1.70 to 1.71, the bug disappeared.
Thanks!

gohguodong

gohguodong commented on Dec 31, 2020

@gohguodong

Hi, i am experiencing similar issue with the rtx3080. i use the command below to install the pytorch
pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html

below are the details of my environment:
PyTorch version: 1.7.1+cu110
Is debug build: False
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.7.1+cu110
[pip3] torchvision==0.8.2+cu110
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.2.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] torch 1.7.1+cu110 pypi_0 pypi
[conda] torchvision 0.8.2+cu110 pypi_0 pypi

the code python -c "import torch;print(torch.max(torch.rand((30,30),device='cuda')))" works on my device.

however, the no kernel image error still persists in pytorch 1.7.1. FYI, i am using WSL2. any idea what went wrong?

YUyttendaele

YUyttendaele commented on Jan 3, 2021

@YUyttendaele

@gohguodong I'm not sure, but your nvidia driver may be incompatible with CUDA 11.0. Check https://docs.nvidia.com/deploy/cuda-compatibility/index.html for more info.

maltevb

maltevb commented on Jan 29, 2021

@maltevb

EDIT

Reinstalling the nvidia driver, cuda and torch as mentioned in @malfet did the trick for me. After that the nvidia-smi command worked and torch detected my GPUs again.

Hey guys,
i am also facing the same problem with my RTX 3090.
The specs are:

torch = 1.7.1
cuda = 11.2
nvidia driver = 460.32.03
torchvision = 0.8.2

Anybody has an idea how to fix it? Tried also torch = 1.7.0, but the error still occurs.
Would appreciate if you share your experience.

See the full error message below:
Traceback (most recent call last):
File "/home/user/20210129_instance_nn_tuned/years/code/solver.py", line 101, in train
y_predicted = model(x_train).flatten() # flatten to (N,) otherwise it has shape (N,1) and loss function throws a warning
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/20210129_instance_nn_tuned/years/code/nn_architecture.py", line 53, in forward
out = self.model(x)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())

RuntimeError: CUDA error: no kernel image is available for execution on the device

hszhoushen

hszhoushen commented on Feb 2, 2021

@hszhoushen

Hey guys,
i am also facing the same problem with my RTX 3090.
The specs are:

torch = 1.7.1
cuda = 11.2
nvidia driver = 460.32.03
torchvision = 0.8.2

Anybody has an idea how to fix it? Tried also torch = 1.7.0, but the error still occurs.
Would appreciate if you share your experience.

See the full error message below:
Traceback (most recent call last):
File "/home/user/20210129_instance_nn_tuned/years/code/solver.py", line 101, in train
y_predicted = model(x_train).flatten() # flatten to (N,) otherwise it has shape (N,1) and loss function throws a warning
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/20210129_instance_nn_tuned/years/code/nn_architecture.py", line 53, in forward
out = self.model(x)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/home/user/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())

RuntimeError: CUDA error: no kernel image is available for execution on the device

Hi, did you solve the problem? I have a similar configuration and with the same problem.
RTX3090
torch = 1.7.1
cuda = 11.1
nvidia driver = 460.32.03
torchvision = 0.8.2

linhduongtuan

linhduongtuan commented on Feb 28, 2021

@linhduongtuan

I am struggling with the same issue when my machine with RTX 3090 is running on Ubuntu 20.04, CUDA = 11.2, Nvidia-driver = 460.39, Pytorch from "pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio===0.7.2 -f https://download.pytorch.org/whl/torch_stable.html" (also try to using Nightly version) ==> Error messenger:
"RuntimeError: CUDA error: no kernel image is available for execution on the device"

One important thing, before my machine was updated automatically, everything was fine. Later, it is collapsed.
Furthermore, when I call nvidia-smi the message also indicates an error at "Volatile Uncorr. ECC ...." you can see my attach image. AND if it overcomes the first error above, GPU-Util is nearly 0-1%!!!!!!!
Screenshot from 2021-02-28 16-49-08

30 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @malfet@hszhoushen@H-Huang@ngimel@uguranium

        Issue actions

          RuntimeError: CUDA error: no kernel image is available for execution on the driver, when use pytorch 1.7 on linux with RTX 3090 · Issue #49161 · pytorch/pytorch