Skip to content

OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked! #45044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Harsh188 opened this issue Nov 20, 2020 · 20 comments
Closed
Assignees
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 type:support Support issues

Comments

@Harsh188
Copy link
Contributor

Harsh188 commented Nov 20, 2020

System information

  • Linux Ubuntu 20.04
  • TensorFlow installed from Docker tensorflow/tensorflow:2.4.0rc1
  • TensorFlow version: 2.4.0rc2
  • Python version: 3.6.9
  • Installed using Docker
  • CUDA/cuDNN version: CUDA 11.1 cuDNN v8
  • GPU model and memory: RTX 3080 FE 10GB

Describe the problem
While training custom resnet 50 model I get the following build error:

2020-11-20 12:05:01.826720: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!

I don't think the code has any issues. It works fine when training with CPU.

Any other info / logs

2020-11-20 12:04:55.291380: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2020-11-20 12:04:55.291414: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
2020-11-20 12:04:55.291455: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1365] Profiler found 1 GPUs
2020-11-20 12:04:55.360280: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcupti.so.11.0
2020-11-20 12:04:55.491657: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.
2020-11-20 12:04:55.491780: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flushed
2020-11-20 12:04:56.592756: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2020-11-20 12:04:56.610956: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3899970000 Hz
Epoch 1/30
2020-11-20 12:04:58.010569: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-11-20 12:04:58.802284: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-11-20 12:04:58.807134: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-11-20 12:05:01.826720: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!
Traceback (most recent call last):
  File "custom_resnet.py", line 131, in <module>
    train_model()
  File "custom_resnet.py", line 105, in train_model
    callbacks=[tensorboard_callback]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 560, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError:  No algorithm worked!
	 [[node model/conv1/Conv2D (defined at custom_resnet.py:105) ]] [Op:__inference_train_function_8452]

Function call stack:
train_function

2020-11-20 12:05:01.905250: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
	 [[{{node PyFunc}}]]

nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3080    On   | 00000000:2B:00.0  On |                  N/A |
|  0%   43C    P8    25W / 320W |    857MiB /  9995MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

tf.test.is_gpu_available()

WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2020-11-20 12:10:11.234638: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-11-20 12:10:11.235502: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2020-11-20 12:10:11.269174: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.269569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:2b:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.71GHz coreCount: 68 deviceMemorySize: 9.76GiB deviceMemoryBandwidth: 707.88GiB/s
2020-11-20 12:10:11.269584: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-11-20 12:10:11.271142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-11-20 12:10:11.271167: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-11-20 12:10:11.271830: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-11-20 12:10:11.271954: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-11-20 12:10:11.273538: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2020-11-20 12:10:11.273878: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-11-20 12:10:11.273963: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-11-20 12:10:11.274040: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.274432: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.274959: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-11-20 12:10:11.274975: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-11-20 12:10:11.593266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-20 12:10:11.593303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2020-11-20 12:10:11.593309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2020-11-20 12:10:11.593483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.593857: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.594195: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-20 12:10:11.594517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 8743 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3080, pci bus id: 0000:2b:00.0, compute capability: 8.6)
True

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

@Harsh188 Harsh188 added the type:build/install Build and install issues label Nov 20, 2020
@bhack
Copy link
Contributor

bhack commented Nov 20, 2020

Can you check the CUDA/CUDNN versions in the image/container against this #43718 (comment)?

@Harsh188
Copy link
Contributor Author

Harsh188 commented Nov 21, 2020

@bhack
The versions from docker are:
CUDA=11.0
CUDNN=8.0.4.30-1
The only difference being CUDA 11.1 vs 11.0. An earlier comment in the same issue stated that 11.0 worked for their 3090.

@bhack
Copy link
Contributor

bhack commented Nov 21, 2020

See #44832 (comment)

@bhack
Copy link
Contributor

bhack commented Nov 21, 2020

@AZdora Can you try to run a Resnet (https://keras.io/api/applications/resnet/) in on your 3080 GPU with your working Docker container?

@ravikyram ravikyram added stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 labels Nov 23, 2020
@king398
Copy link

king398 commented Nov 23, 2020

try adding this just after importing everthing.
physical_devices = tf.config.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(physical_devices[0], True)

@Harsh188
Copy link
Contributor Author

Harsh188 commented Nov 23, 2020

@king398 When doing that I got the following error:

2020-11-23 14:12:00.220322: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2020-11-23 14:12:00.220363: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
2020-11-23 14:12:00.220398: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1365] Profiler found 1 GPUs
2020-11-23 14:12:00.221167: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcupti.so.11.0
2020-11-23 14:12:00.319763: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.
2020-11-23 14:12:00.319890: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flushed
2020-11-23 14:12:02.049733: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2020-11-23 14:12:02.069156: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3899740000 Hz
Epoch 1/10
2020-11-23 14:12:04.273239: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-11-23 14:12:04.680420: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-11-23 14:12:05.356500: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2020-11-23 14:12:06.524399: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

2020-11-23 14:12:06.524575: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Unimplemented: /usr/local/cuda-11.0/bin/ptxas ptxas too old. Falling back to the driver to compile.
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-11-23 14:12:49.805540: W tensorflow/stream_executor/gpu/asm_compiler.cc:235] Your CUDA software stack is old. We fallback to the NVIDIA driver for some compilation. Update your CUDA version to get the best performance. The ptxas error was: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

@king398
Copy link

king398 commented Nov 23, 2020

You are using cuda 11.0 which is not compatible with rtx 30 series. Try installing cuda 11.1 And you can also try installing through pip instead of docker. it says in the warning to upgrade your cuda software and Your CUDA software stack is old.Also please tell your driver version

@Harsh188
Copy link
Contributor Author

Harsh188 commented Nov 29, 2020

@king398 I have a lot of issues trying to run it using pip.
Specifically being:

2020-11-29 10:04:29.124995: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

I'm not sure what I'm missing. I've downloaded CUDA 11.1 and CUDNN.

I find that using a docker container is much better since all of the dependencies are packaged by TensorFlow themselves. If there's an issue with the CUDA version that is provided through the docker image from TensorFlow then that should be looked into.

This issue still exists with version rc3.

@ravikyram ravikyram added comp:gpu GPU related issues type:support Support issues and removed type:build/install Build and install issues labels Nov 29, 2020
@ravikyram ravikyram assigned rmothukuru and unassigned ravikyram Nov 29, 2020
@rmothukuru rmothukuru assigned sanjoy and unassigned rmothukuru Dec 4, 2020
@rmothukuru rmothukuru added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 4, 2020
@Sylv-Lej
Copy link

Sylv-Lej commented Dec 7, 2020

same issue with nvidia/cuda:11.0-cudnn8-devel-ubuntu18.04 and RTX 3080

using cuda 11.1 cause :

Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory

tried with rc0 -> rc4

Edit : Fixed

docker image : nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
tf version : tf-nightly-gpu

Need to change LD_LIBRARY_PATH in order to make simlink

ENV LD_LIBRARY_PATH=/usr/local/cuda-11.1/targets/x86_64-linux/lib

Make simlink so libcusolver.so.10 is defined

RUN ln -s /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcusolver.so.11 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcusolver.so.10

if you have cublas error you can try this :

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

@Harsh188
Copy link
Contributor Author

Harsh188 commented Dec 7, 2020

I've found a temporary solution by using software provided by lambda stack. It works on ubuntu 20.04 for all RTX 30 series GPUs.

@sanjoy
Copy link
Contributor

sanjoy commented Dec 24, 2020

TF 2.4 is built & tested against CUDA 11.0, not 11.1.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 26, 2020
@aolivieri
Copy link

aolivieri commented Jan 21, 2021

I have the exact same problem trying to make TF work with my RTX 3070. CUDA 11.1 + CUDNN 8.0.5.39 + TF2.4.0

Note: I had to make the symlink trick so TF could find the libcusolver.so.10 which is obviously not available in the CUDA 11.1 package

@napulen
Copy link

napulen commented Jan 28, 2021

I experienced this issue on an MSI GL65 with an RTX2070 on Ubuntu 20.04.

Dynamic libraries are the following:

In [1]: import tensorflow                                                                                                                                                                                          
2021-01-28 16:05:15.891481: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0

In [2]: tensorflow.__version__                                                                                                                                                                                     
Out[2]: '2.4.0'

In [3]: tensorflow.config.experimental.list_physical_devices('GPU')                                                                                                                                                
2021-01-28 16:06:40.579904: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-28 16:06:40.588165: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-28 16:06:40.619240: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-28 16:06:40.619800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.455GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 327.88GiB/s
2021-01-28 16:06:40.619823: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-28 16:06:40.627330: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-28 16:06:40.627382: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-28 16:06:40.631550: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-28 16:06:40.633606: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-28 16:06:40.642000: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-28 16:06:40.644472: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-28 16:06:40.645649: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-28 16:06:40.645749: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-28 16:06:40.646153: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-28 16:06:40.646490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
Out[3]: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Adding the lines indicated by @king398 solved my issue.

try adding this just after importing everthing.
physical_devices = tf.config.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(physical_devices[0], True)

@asferrer
Copy link

asferrer commented Apr 6, 2021

Adding the lines indicated by @king398 solved my Issue too on my GL65 with RTX2070 on Ubuntu 20.04

@vladGriguta
Copy link

If the error persists after setting the GPU memory growth configuration, as indicated by @king398, you might want to try dropping the batch size during training.

@aolivieri
Copy link

One additional hint since it took me some time to figure it out. The set_memory_growth() didn't take effect in my setup until I added the os.environ['CUDA_VISIBLE_DEVICES']="0" (note I have only one GPU).

BTW, this still looks like a workaround to me and ideally we would have to fix this (I didn't face this problem with the older versions of CUDA and cuDNN compatible with the RTX20xx series).

@Saduf2019
Copy link
Contributor

@Harsh188
Could you please let us know if this is still an issue in latest stable TF v2.6.0 ?Thank you!

@Saduf2019 Saduf2019 added the stat:awaiting response Status - Awaiting response from author label Nov 2, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Nov 9, 2021
@Harsh188
Copy link
Contributor Author

@Saduf2019 there are no issues with v2.6.0

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 type:support Support issues
Projects
None yet
Development

No branches or pull requests