trtmodel(max batch size =2) inference time spent about 2 times than trtmodel(max batch size =1) on convolution and activation layer #1046

githublsk · 2021-02-05T08:17:15Z

Environment:

TensorRT Version: 7.2.1
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System + Version: ubuntu18.04
Python Version: 3.6.10
PyTorch Version: 1.7.0
##Description:
First, using the project mmdetection-to-tensorrt(link:https://github.com/grimoire/mmdetection-to-tensorrt) to convert our faster rcnn model(.pth file) directly convert to trtmodel(max batch size =1), the convert command is as follows:

Then, using same command but set max batch size =2, the convert command is as follows:

Third, using above converted trtmodels to infer with same image respectively, for trtmodel (with max batch size =2) the image repeated twice, adding the below code to record the layer time, then summarising the top20 layer time consuming for above two trtmodels as following table, from the table, for the network layers (with green identified) time consuming for the model (with max batch size =2) are almost double compared to the model(with max batch size =1), that seems unreasonable, because the network layers (with green identified) is convolution layer with tensor operation which can be considered parallel. Can you give me some suggestions? Thank you.

githublsk · 2021-02-19T02:41:08Z

@ttyio @nvpohanh can you help me to answer above question? because it is so an urgent question to me, thank you very much!

nvpohanh · 2021-02-19T02:46:53Z

It is expected that BS=2 will take almost 2x of runtime of BS=1 because you now have 2x of math operations but the math throughput remains the same.

because the network layers (with green identified) is convolution layer with tensor operation which can be considered parallel.

Could you elaborate why they will be in parallel? GPU has limited throughput for tensor operation and even at BS=1, it may be already using all the math throughput.

githublsk · 2021-02-19T03:53:10Z

@nvpohanh Thank you for your response so quickly, as can be seen in the reference "https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html", in this insutruction in 2.2. Batching department describing "The most important optimization is to compute as many results in parallel as possible using batching" and "Each layer of the network will have some amount of overhead and synchronization required to compute forward inference. By computing more results in parallel, this overhead is paid off more efficiently. In addition, many layers are performance-limited by the smallest dimension in the input. If the batch size is one or small, this size can often be the performance limiting dimension. For example, the FullyConnected layer with V inputs and K outputs can be implemented for one batch instance as a matrix multiply of an 1xV matrix with a VxK weight matrix. If N instances are batched, this becomes an NxV multiplied by VxK matrix. The vector-matrix multiply becomes a matrix-matrix multiply, which is much more efficient". as mentioned above, I think convolution layer can achieve the parallel calculation for batching to increase the efficient, but batching does not affect in convolution layer in our detection model, we are so confused, where does it affect to increase the efficient? can you give me some hint? Thank you.

githublsk · 2021-02-20T02:28:09Z

@ttyio @nvpohanh can you help me to answer above question? Thanks in advance

nvpohanh · 2021-02-21T03:01:42Z

Generally, GPU computation is more efficient when the batch size is larger. This is because when you have a lot of ops, you can fully utilize the GPUs and hide some inefficiency or overhead between ops. However, if there are already a lot of ops at BS=1 and even BS=1 is able to fully utilize the GPUs, you may not see any increase in efficiency anymore.

For example, is your input size BSx3x1600x1000? This is a super large image which is expected to fully utilize even the largest GPU we have (like A100), so I don't think increasing BS gives benefit on GPU efficiency.

In terms of N/V/K, in your case the "N" is already 1600x1000 at BS=1, so N=1600x1000 vs N=2x1600x1000 do not make too much difference in turns of GPU efficiency, compared to N=1 vs N=2.

githublsk · 2021-02-22T01:12:40Z

@nvpohanh Thank you for your response, I have a question how can I confirm if gpus has been fully utilized? for example, using the "nvidia-smi" command or any other command or tools to confirm it? Thanks in advance!

nvpohanh · 2021-02-22T01:33:07Z

Yes! You can run nvidia-smi dmon -s u in parallel to check utilization. Or use Nsight Systems to visualize the profiles: https://developer.nvidia.com/nsight-systems

githublsk · 2021-02-22T03:10:33Z

Hi，@nvpohanh，in our test envrionment, when doing the inference, the gpu monitored information as below:

It seems the gpu efficiency is limited by sm, in the situation of not changing our test reqirement, how to decrease the sm value to improve the gpu efficiency? can you give us some suggestion? it would be great help!

nvpohanh · 2021-02-22T04:21:41Z

SM utilization is the GPU utilization. Ideally, you want to see all 100s for SM utilization. Could you profile it using Nsight Systems to see why GPU is sometimes idle, resulting in low SM utilization? Maybe there are gaps between batches?

githublsk · 2021-02-25T09:03:26Z

@nvpohanh Thanks for your response!

Could you profile it using Nsight Systems to see why GPU is sometimes idle, resulting in low SM utilization? Maybe there are gaps between batches?

I have downloaded the CLI Nsight System and use the command as below to monitor our application with batchsize 1.
command:
nsys profile -o tensorrt --trace=cuda --cudabacketrace=all --stats=true --sample=cpu python model_fastercnn_pytorch.py
qdrep file generated:

I have below two questions, can you give me some suggestions?

In the qdrep file, how can I get the infomation about "why GPU is sometimes idle?" as you said, I just analysis the CUDA API trace, but I can not get any information about the GPU idle that results in low SM utilization.
When doing the inferrence for the first time using the tensorrt engine, it cost so long time in some cuda operation, as you seen in qdrep file, how can I reduce the some cuda operation time?

The picture information that I uploaded is less, if needed I can upload the qdrep file. Thanks in advance!

githublsk · 2021-02-26T01:25:08Z

@nvpohanh can you help me to slove above problem? because I have little experiences about it, so I hope you can give me some suggestion, Thanks in advance.

nvpohanh · 2021-02-26T01:34:26Z

The part you showed is just the set up stage. The actual inference at the end. Maybe you can send me the qdrep file so that I can take a quick look.

githublsk · 2021-03-01T01:47:27Z

@nvpohanh Thank you very much.
Attached file is my qdrep file, please help me to analyse it, this is a great help for me, thank you.
tensorrt.zip

nvpohanh · 2021-03-01T02:22:09Z

It seems that you only run one batch and time it. Is that correct? To correctly measure the latency of a batch, it is recommended that:

You measure the total time for many inferences and fine the average latency, rather than running just one inference.
(more advanced) If possible, run the host-to-device copy on a separate CUDA stream and use CUDA events to synchronizations between the main inference stream and the H2D copy stream. You can check the source code of trtexec to find an example for this.

Thanks

githublsk · 2021-03-01T09:51:50Z

@nvpohanh, Thank you for your response.
Yes I just run one batch and time it.
For step one, attached file is 10 times inference for one image, can you help me to analyse it?
tensorrt_10.zip

For step two, I could not follow your suggestion, can you give me relative source code? so I can quickly confirm it.
Thank you very much.

So many trouble to bother you, I am so sorry, but you help us so

githublsk · 2021-03-02T02:34:39Z

@nvpohanh can you help me to reslove above question? Thank you in advance.

nvpohanh · 2021-03-02T02:42:19Z

For step two, I could not follow your suggestion, can you give me relative source code? so I can quickly confirm it.

Please refer to our trtexec as the example source code.

The profile looks fine to me. I don't think there is any issue in TRT. There seem to be some Python code overhead between batches. You may want to optimize those if you want to fully utilize GPUs. Could you let me know if you have more questions? Thanks

nvpohanh · 2021-03-02T02:43:08Z

For example, you can remove the cuda stream synchronizes between the batches and only synchronize at the end.

nvpohanh · 2021-03-02T02:43:39Z

or replace the cudaMemcpy() with cudaMemcpyAsync().

githublsk · 2021-03-02T07:47:50Z

Hi @nvpohanh
Attached file inculding our test code and new test qdrep file, can you help me analyse it? Maybe some Python code overhead between batches, can you help us check it? because we have few experience, so many trouble bother you, I am so sorry about it, and many thanks to you.

resource.zip

githublsk · 2021-03-04T01:16:13Z

@nvpohanh If you have time, can you help me to reslove above question?

nvpohanh · 2021-03-04T02:44:50Z

Hi @githublsk ,
Unfortunately, I am not sure how much I can still help you, as the issue does not lie in TRT, but in other parts of the Python codes outside of TRT. I would recommend that you find some resources about how to deploy TensorRT models in production to achieve maximum GPU utilization. One possibility is to use Triton Inference Server to load TRT engines and let Triton Inference Server schedule the inferences. Thanks

Here are some examples:

@ttyio Anything else you think we can provide help with? Since this is not a TRT-specific issue.

ttyio · 2021-03-04T05:13:19Z

+1 for @nvpohanh 's suggestion to use Triton Inference Server, thanks!

ttyio · 2021-05-21T06:17:05Z

Close since no activity for more than 3 weeks, please reopen if you still have question, thanks!

ttyio added ask-the-experts Module:Performance and removed Topic: Triton labels Feb 22, 2021

ttyio added good-reference triaged labels Apr 26, 2021

ttyio closed this as completed May 21, 2021

trtmodel(max batch size =2) inference time spent about 2 times than trtmodel(max batch size =1) on convolution and activation layer #1046

trtmodel(max batch size =2) inference time spent about 2 times than trtmodel(max batch size =1) on convolution and activation layer #1046

Comments

githublsk commented Feb 5, 2021

Environment:

githublsk commented Feb 19, 2021

Uh oh!

nvpohanh commented Feb 19, 2021

Uh oh!

githublsk commented Feb 19, 2021

Uh oh!

githublsk commented Feb 20, 2021

Uh oh!

nvpohanh commented Feb 21, 2021

Uh oh!

githublsk commented Feb 22, 2021

Uh oh!

nvpohanh commented Feb 22, 2021

Uh oh!

githublsk commented Feb 22, 2021

Uh oh!

nvpohanh commented Feb 22, 2021

Uh oh!

githublsk commented Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githublsk commented Feb 26, 2021

Uh oh!

nvpohanh commented Feb 26, 2021

Uh oh!

githublsk commented Mar 1, 2021

Uh oh!

nvpohanh commented Mar 1, 2021

Uh oh!

githublsk commented Mar 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

githublsk commented Mar 2, 2021

Uh oh!

nvpohanh commented Mar 2, 2021

Uh oh!

nvpohanh commented Mar 2, 2021

Uh oh!

nvpohanh commented Mar 2, 2021

Uh oh!

githublsk commented Mar 2, 2021

Uh oh!

githublsk commented Mar 4, 2021

Uh oh!

nvpohanh commented Mar 4, 2021

Uh oh!

ttyio commented Mar 4, 2021

Uh oh!

ttyio commented May 21, 2021

Uh oh!

githublsk commented Feb 25, 2021 •

edited

Loading

githublsk commented Mar 1, 2021 •

edited

Loading