-
Notifications
You must be signed in to change notification settings - Fork 74.7k
slim.separable_conv2d is too slow #12132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am also wondering why tf.nn.separable_conv2d is so slow compared to tf.nn.conv2d? I would expect the separable conv to be a lot faster when the channel multiplier is far smaller than the number of output channels? In reality, it is only slightly faster. Why is this? Is it because conv2d uses cudnn internally, whereas the seperable_conv2d does not? Or is there an other reason? |
@reedwm, the code below shows that the separable convolution (depthwise followed by pointwise) is pretty much useless. It is much more efficient in terms of time to compute the effective filters and use the normal convolution function (tf.nn.conv2d) instead of the seperable convolution. With the settings below, one would expect the separable convolution to be much faster since it only needs to compute 32x8 convolutions heavy convolutions (15x15 filter size) and 32x8x64 light convolutions (1x1 filter size). Whereas the normal convolution needs to compute 32x64 heavy convolutions (15x15 filter size). import tensorflow as tf
import numpy as np
import time
# Define a scenario
batch_size = 64
channels = 32
image_size = 32
feature_maps = 64
filter_size = 15
depthwise_filters = 8
# Dummy images
images = tf.random_normal(shape=[batch_size, channels, image_size, image_size],
dtype=tf.float32)
# Filter definitions
basis_filters = tf.random_normal(shape=[filter_size, filter_size, channels, depthwise_filters],
dtype=tf.float32)
coeffs = tf.random_normal(shape=[channels, depthwise_filters, feature_maps],
dtype=tf.float32)
# Normal method
effective_filters = tf.einsum('hwcm,cmn->hwcn', basis_filters, coeffs)
normal = tf.nn.conv2d(images,
effective_filters,
strides=[1, 1, 1, 1],
padding="SAME",
use_cudnn_on_gpu=True,
data_format="NCHW")
# Separable method
depthwise = tf.nn.depthwise_conv2d_native(images,
basis_filters,
strides=[1, 1, 1, 1],
padding="SAME",
data_format="NCHW")
coeffs = tf.reshape(coeffs, [1, 1, channels*depthwise_filters, feature_maps])
separable = tf.nn.conv2d(depthwise,
coeffs,
strides=[1, 1, 1, 1],
padding="VALID",
use_cudnn_on_gpu=True,
data_format="NCHW")
with tf.Session() as sess:
# Assert equality of the different methods
norm, sep = sess.run([normal, separable])
np.testing.assert_almost_equal(norm, sep, decimal=3)
repeats = 100
# Benchmark normal method
start = time.time()
for _ in xrange(repeats):
_ = sess.run(normal)
end = time.time()
d1 = int((end - start) / repeats * 1000)
# Benchmark seperable method
start = time.time()
for _ in xrange(repeats):
_ = sess.run(separable)
end = time.time()
d2 = int((end - start) / repeats * 1000)
# Print results
print("Normal method: {}ms \t Separable method: {}ms".format(d1, d2)) Evaluated on a Nvidia M60 with tensorflow-v1.1.0, this code outputs:
My guess is that the tf.nn.depthwise_conv2d function is much slower than the tf.nn.conv2d ? |
I found this question on stackoverflow, which exactly captures the essence of my remark that seperable convolution in its current implementation seems pretty much useless, because the depthwise convolution is much slower than the normal tf.nn.conv2d: https://stackoverflow.com/questions/39368367/tf-nn-depthwise-conv2d-is-too-slow-is-it-normal |
I am facing with the same problem, seperable convolution some times run slower than normal conv2d on my GPU, but faster than conv2d on CPU, did you manage to find the solution ? |
@BKZero Could you post an example using the separableconv2d function of keras? . I want to present an example of a CNN model with conv2d to another with separableconv2d, but I do not find examples in keras. |
Any updates on this? I have experienced separable convolutions running slower than regular convolution at inference time as well. |
In my eyes, the slow performance of tf.nn.depthwise_conv2d() compared to tf.nn.conv2d() is definitely an issue (see my reaction above). |
Any updates on this? I have experienced separable convolutions running slower than regular convolution at inference time as well. |
Any updates on this? I have a similar experience |
@stengoes I think your implementation may be wrong,
2.To implement the right separable, you can simply use separable conv in tensor lib In my test, separable conv is faster than traditional conv on CPU(Mac, 2 GHz Intel Core i7) In terms of depth-wise, it is faster than traditional, But the speed-up is not proportional the MAC operation ratio of two conv(depth-wise is slower than the expected) |
@AustinVan I still believe that my implementation is right. The number of output channels of the depthwise-conv does NOT necessarily have to be equal to the number input channels. See the documentation of the seperable conv here: It says that the pointwise filter has dimensions: This means that the number of input channels of the pointwise-conv (which is the same as the number of output channels of the depthwise-conv) is channel_multiplier * in_channels. So unless channel_multiplier equals 1 your claim is wrong. The channel multiplier comes from the number of depthwise filters which has dimensions: Moreover the implementation also checks for equality between my implementation and the seperable conv: # Assert equality of the different methods
norm, sep = sess.run([normal, separable])
np.testing.assert_almost_equal(norm, sep, decimal=3) However, I did just notice that with the newer versions of tensorflow the implementation of the seperable conv layer (see here) has changed. So the new implementation might be faster now. I will check it later this week. |
@stengoes Thanks. But in the paper https://arxiv.org/abs/1704.04861 I am not sure the if accuracy will increase or decrease when the internal channel becomes large in mobilenet. But in my opinion, I think it is an unfair comparison if we didn't follow the paper. When I changed the output channel of the depthwise conv in your code, the time of depthwise+pointwise is equal to that of separable conv in tensor lib |
I found another interesting repohttps://github.com/peisuke/DeepLearningSpeedComparison, which lists all the speed of mobilenet on several mainstreaming deep learning framework. If you still think the implementation of tensorflow is too slow, |
sorry i am late. i found a intrest phenomenon, if i use mobilenet on PC, it is really slow, but on android platform, it is fast. i do not know why, but it is real. |
Any updates? I can confirm this problem on PC. Other DL libraries seem to face similar problems. I found an interesting repo for a caffe impl. of the sep conv. that sped up the computation noticably: https://github.com/yonghenglh6/DepthwiseConvolution |
import tensorflow as tf
import numpy as np
import time
# Define a scenario
batch_size = 64
channels = 32
image_size = 32
feature_maps = 64
filter_size = 15
depthwise_filters = 8
# Dummy images
images = tf.random_normal(shape=[batch_size, channels, image_size, image_size],
dtype=tf.float32)
# Filter definitions
basis_filters = tf.random_normal(shape=[filter_size, filter_size, channels, depthwise_filters],
dtype=tf.float32)
coeffs = tf.random_normal(shape=[channels, depthwise_filters, feature_maps],
dtype=tf.float32)
# Normal method
effective_filters = tf.einsum('hwcm,cmn->hwcn', basis_filters, coeffs)
normal = tf.nn.conv2d(images,
effective_filters,
strides=[1, 1, 1, 1],
padding="SAME",
use_cudnn_on_gpu=True,
data_format="NCHW")
# Separable method
depthwise = tf.nn.depthwise_conv2d_native(images,
basis_filters,
strides=[1, 1, 1, 1],
padding="SAME",
data_format="NCHW")
coeffs = tf.reshape(coeffs, [1, 1, channels*depthwise_filters, feature_maps])
separable = tf.nn.conv2d(depthwise,
coeffs,
strides=[1, 1, 1, 1],
padding="VALID",
use_cudnn_on_gpu=True,
data_format="NCHW")
with tf.Session() as sess:
# Assert equality of the different methods
norm, sep = sess.run([normal, separable])
np.testing.assert_almost_equal(norm, sep, decimal=3)
repeats = 100
# Benchmark normal method
start = time.time()
for _ in xrange(repeats):
_ = sess.run(normal)
end = time.time()
d1 = int((end - start) / repeats * 1000)
# Benchmark seperable method
start = time.time()
for _ in xrange(repeats):
_ = sess.run(separable)
end = time.time()
d2 = int((end - start) / repeats * 1000)
# Print results
print("Normal method: {}ms \t Separable method: {}ms".format(d1, d2)) I evaluated the code snippet once more. This time on a Nvidia Pascal Titan X:
In tensorflow-v1.6.0 (on the the pascal titan X) the separable method is ~4-5x slower than the normal method. So one could argue that the relative performance of the separable conv has improved with the newer versions of tensorflow. I guess that the normal method is still faster because it only has to apply 1 CUDA kernel?! Whereas the separable convolution needs 2 kernels: a depthwise and a pointwise kernel?! Maybe this could be solved by fusing the depthwise and pointwise CUDA kernel? But I am no CUDA expert. TLDR: |
For me the tf.nn.depthwise_conv2d_native function is really annoying. The code is as follow:
|
I find that training mobile v2 is much slow than mobilenet v1 on tensorflow, about 50% fps. Then main difference between the two versions is that latter has much larger depthwise_conv layers. But the training time on Pytorch is reduced. These problems are also reported in https://www.zhihu.com/question/265709710. So I think the overhead occurs in tensorflow rather than CUDA. |
I'm using arch linux. So for the experiment mentioned above it's tensorflow
v1.6.0 with cuda 9.1.85.2 and cudnn 7.1.2 on the gtx 1080. I didn't test
this on tensorflow v1.7.0 though.
2018-04-19 15:41 GMT+08:00 Sebyakin Andrei <notifications@github.com>:
… What about perfomance with cudnn 7.1? One of its new features is to handle
grouped conv.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#12132 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF_Qx44qMSMsMzC0BYYtIUvqru1YhmDLks5tqD-ygaJpZM4OxxQM>
.
|
Any updates on this? I have a similar experience. Do you guys have any alternatives? |
same here! I have got a same experience. please update any news. Thank you in advance. |
I was also previously having issues with the speed of separable convolutions but it was because of how I was using them. The effectiveness of separable convolutions compared to normal convolutions depends on two variables, the number of channels and the depth multiplier. This script demonstrates where separable convolutions are faster than normal convolutions and where they are slower.
Results:
When channels is small (32) separable convs are actually slower than normal convs. The value of separable convs is apparent as the number of channels increases. The inflection point is ~64 channels. The depth multiplier does not effect the runtime performance of normal convolutions. Which makes sense, the matrix multiplication of the For my own projects, I have been using normal convolutions for the first 2-3 layers and switch to separable convs when the number of channels is at least 64. I also exclusively use a depth multiplier of 1. Using these two ideas I have personally seen models using separable convs perform 50-100% faster than models using only normal convs. |
@stengoes In your code you've used convolution's associative property by combining the filters of separable convolution and 1x1 convolution assuming that both operation does not use non linearity. After that, you have used this effective filter to convolve with the original input image. This entire operation is taking less time as the first combining operation is applied on much smaller tensors(filters) compared to the other approach(depth separable conv using native functions.). However, this assumption of linearity does not hold as we do use relu after separable convolution and also the 1x1 convolution in the original MobileNetv1. So, you cannot use the normal method. I have edited your code and added the relu after each operation. It should give you an assertion error at line np.testing.assert_almost_equal(norm, sep, decimal=3) when you try to run it. import tensorflow as tf
import numpy as np
import time
# Define a scenario
batch_size = 64
channels = 32
image_size = 32
feature_maps = 64
filter_size = 15
depthwise_filters = 8
# Dummy images
images = tf.random_normal(shape=[batch_size, channels, image_size, image_size],
dtype=tf.float32)
# Filter definitions
basis_filters = tf.random_normal(shape=[filter_size, filter_size, channels, depthwise_filters],
dtype=tf.float32)
coeffs = tf.random_normal(shape=[channels, depthwise_filters, feature_maps],
dtype=tf.float32)
# Normal method
effective_filters = tf.einsum('hwcm,cmn->hwcn', basis_filters, coeffs)
#nm = tf.Print(effective_filters, [effective_filters], message="This is a: ")
normal = tf.nn.conv2d(images,
effective_filters,
strides=[1, 1, 1, 1],
padding="SAME",
use_cudnn_on_gpu=True,
data_format="NCHW"
)
normal = tf.nn.relu(normal)
# Separable method
depthwise = tf.nn.depthwise_conv2d_native(images,
basis_filters,
strides=[1, 1, 1, 1],
padding="SAME",
data_format="NCHW",
)
depthwise = tf.nn.relu(depthwise)
coeffs = tf.reshape(coeffs, [1, 1, channels*depthwise_filters, feature_maps])
separable = tf.nn.conv2d(depthwise,
coeffs,
strides=[1, 1, 1, 1],
padding="VALID",
use_cudnn_on_gpu=True,
data_format="NCHW",
)
separable = tf.nn.relu(separable)
with tf.Session() as sess:
# Assert equality of the different methods
norm, sep = sess.run([normal, separable])
np.testing.assert_almost_equal(norm, sep, decimal=3)
repeats = 100
# Benchmark normal method
start = time.time()
for _ in xrange(repeats):
_ = sess.run(normal)
end = time.time()
d1 = int((end - start) / repeats * 1000)
# Benchmark seperable method
start = time.time()
for _ in xrange(repeats):
_ = sess.run(separable)
end = time.time()
d2 = int((end - start) / repeats * 1000)
# Print results
print("Normal method: {}ms \t Separable method: {}ms".format(d1, d2))
writer.close() |
Theoretical faster convolutions from the https://arxiv.org/pdf/1704.04861.pdf Implemented with TF-Slim's seperable_conv Slower than standard convolution tensorflow/tensorflow#12132
I have run the code above, it seems separable_conv is faster than the normal one in this form, but when I use tf.nn.separable_conv2d or tf.keras.layers.SeparableConv2d, the separable one still lower than the normal method, what's wrong with it? Any updates on it? |
Also experiencing that SeparableConv2d is slower than Conv2d in Keras. The number of input_channels does not seem to matter, I tested 32-2048 and in all cases the Conv2d is faster. Interestingly, in the SeparableConv2d-model the number parameters is lower as well as the FLOPS. Still this does not seem to have the wanted affect on the inference. |
I am using TF version 1.13. |
I also face a strange issue. when I use DepthwiseConv2D in decoder stage(4 * 4-->resize to 8 * 8 --> DepthWiseConv2D ---> resize to 16* 16 -->DepthWiseConv2D ...) of a FCN, it run much more slower than the encoder (from big feature map to small feature map) stage which use the same DepthWiseConv2D. (decoder: more than 1s VS encoder: less than 100ms). I ran the test code in a mobile device. |
Any updates of this? |
Same problem here, especially with @chenbiaolong. Have you figured out to fix it? |
why this bug not fixed in many year? separable_conv2d used in mobileNet is very popular. |
Maybe not proper here but here's some of my experiments on MxNET & NVIDIA Titan X + CUDA 10.1(Not sure about cudnn version):
|
There is a paper discussing the trap of FLOPs, maybe in depthwise convolutions the memory access dominates the real cost time on GPU/CPU implementations |
@keunwoochoi problem still exists. I agree with @boluoweifenda ,the slow inference speed are not caused by a bug of depthwise convolutions, FLOPs are not the only factor that affect inference time. maybe we should change the network architecture of decoder stage. |
In my case with tf.layer.separable_conv2d tensorflow v1.12: too slow How about you? |
@yoshizamurai did you test on an arm cpu? I did't test tensorflow v1.14, my test run on tensorflow 1.13 |
So, you should test on an arm cpu with tensorflow 1.14. |
|
This seems to be fixed by latest TF nightly and should be available on 2.2. It requires fp16, NCHW, stride==1, and cuDNN version >= 7.6.3 though. See #33836 for details. Ping @houtoms to confirm. |
Yes, we follow the https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_763.html#rel_763 to enable fast depwise cuDNN paths. |
Hi @byronyi and @houtoms , it requires fp16 , it is not enough. I think this issue still open |
any updates here? I'm suffereing from this too |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
Describe the problem
the depthwise+pointwise structure is faster than the traditional convolution layer theoretically, but the implemetation of tensorflow make it slower. it doesn't make sense.
here is part of my network defination:
#net = slim.conv2d(net, 32, [3, 3], scope='conv1-2')
#end_points['conv1-2'] = net
net = slim.separable_conv2d(net,None,[3,3],depth_multiplier=1,stride=1,rate=1,normalizer_fn=slim.batch_norm,scope='conv1-2-depthwise')
end_points['conv1-2-depthwise'] = net
net = slim.conv2d(net, depth(32), [1, 1], stride=1, normalizer_fn=slim.batch_norm, scope='conv1-2-pointwise')
end_points['conv1-2-pointwise'] = net
i just change the network defination
from:
net = slim.conv2d(net, 32, [3, 3], scope='conv1-2')
end_points['conv1-2'] = net
to:
net = slim.separable_conv2d(net,None,[3,3],depth_multiplier=1,stride=1,
rate=1,normalizer_fn=slim.batch_norm,scope='conv1-2-depthwise')
end_points['conv1-2-depthwise'] = net
net = slim.conv2d(net, depth(32), [1, 1], stride=1, normalizer_fn=slim.batch_norm,
scope='conv1-2-pointwise')
end_points['conv1-2-pointwise'] = net
i do not think i am doing something wrong. so where the problem is?
The text was updated successfully, but these errors were encountered: