Skip to content

trtmodel(max batch size =2) inference time spent about 2 times than trtmodel(max batch size =1) on convolution and activation layer #1046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
githublsk opened this issue Feb 5, 2021 · 24 comments
Labels
Module:Performance General performance issues triaged Issue has been triaged by maintainers

Comments

@githublsk
Copy link

Environment:

TensorRT Version: 7.2.1
CUDA Version: 11.1
CUDNN Version: 8.0.4
Operating System + Version: ubuntu18.04
Python Version: 3.6.10
PyTorch Version: 1.7.0
##Description:
First, using the project mmdetection-to-tensorrt(link:https://github.com/grimoire/mmdetection-to-tensorrt) to convert our faster rcnn model(.pth file) directly convert to trtmodel(max batch size =1), the convert command is as follows:
command
Then, using same command but set max batch size =2, the convert command is as follows:
command_2
Third, using above converted trtmodels to infer with same image respectively, for trtmodel (with max batch size =2) the image repeated twice, adding the below code to record the layer time, then summarising the top20 layer time consuming for above two trtmodels as following table, from the table, for the network layers (with green identified) time consuming for the model (with max batch size =2) are almost double compared to the model(with max batch size =1), that seems unreasonable, because the network layers (with green identified) is convolution layer with tensor operation which can be considered parallel. Can you give me some suggestions? Thank you.
profiler
table_time

@githublsk
Copy link
Author

@ttyio @nvpohanh can you help me to answer above question? because it is so an urgent question to me, thank you very much!

@nvpohanh
Copy link
Collaborator

It is expected that BS=2 will take almost 2x of runtime of BS=1 because you now have 2x of math operations but the math throughput remains the same.

because the network layers (with green identified) is convolution layer with tensor operation which can be considered parallel.

Could you elaborate why they will be in parallel? GPU has limited throughput for tensor operation and even at BS=1, it may be already using all the math throughput.

@githublsk
Copy link
Author

@nvpohanh Thank you for your response so quickly, as can be seen in the reference "https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html", in this insutruction in 2.2. Batching department describing "The most important optimization is to compute as many results in parallel as possible using batching" and "Each layer of the network will have some amount of overhead and synchronization required to compute forward inference. By computing more results in parallel, this overhead is paid off more efficiently. In addition, many layers are performance-limited by the smallest dimension in the input. If the batch size is one or small, this size can often be the performance limiting dimension. For example, the FullyConnected layer with V inputs and K outputs can be implemented for one batch instance as a matrix multiply of an 1xV matrix with a VxK weight matrix. If N instances are batched, this becomes an NxV multiplied by VxK matrix. The vector-matrix multiply becomes a matrix-matrix multiply, which is much more efficient". as mentioned above, I think convolution layer can achieve the parallel calculation for batching to increase the efficient, but batching does not affect in convolution layer in our detection model, we are so confused, where does it affect to increase the efficient? can you give me some hint? Thank you.

@githublsk
Copy link
Author

@ttyio @nvpohanh can you help me to answer above question? Thanks in advance

@nvpohanh
Copy link
Collaborator

Generally, GPU computation is more efficient when the batch size is larger. This is because when you have a lot of ops, you can fully utilize the GPUs and hide some inefficiency or overhead between ops. However, if there are already a lot of ops at BS=1 and even BS=1 is able to fully utilize the GPUs, you may not see any increase in efficiency anymore.

For example, is your input size BSx3x1600x1000? This is a super large image which is expected to fully utilize even the largest GPU we have (like A100), so I don't think increasing BS gives benefit on GPU efficiency.

In terms of N/V/K, in your case the "N" is already 1600x1000 at BS=1, so N=1600x1000 vs N=2x1600x1000 do not make too much difference in turns of GPU efficiency, compared to N=1 vs N=2.

@githublsk
Copy link
Author

@nvpohanh Thank you for your response, I have a question how can I confirm if gpus has been fully utilized? for example, using the "nvidia-smi" command or any other command or tools to confirm it? Thanks in advance!

@nvpohanh
Copy link
Collaborator

Yes! You can run nvidia-smi dmon -s u in parallel to check utilization. Or use Nsight Systems to visualize the profiles: https://developer.nvidia.com/nsight-systems

@githublsk
Copy link
Author

Hi,@nvpohanh,in our test envrionment, when doing the inference, the gpu monitored information as below:
gpu
It seems the gpu efficiency is limited by sm, in the situation of not changing our test reqirement, how to decrease the sm value to improve the gpu efficiency? can you give us some suggestion? it would be great help!

@nvpohanh
Copy link
Collaborator

SM utilization is the GPU utilization. Ideally, you want to see all 100s for SM utilization. Could you profile it using Nsight Systems to see why GPU is sometimes idle, resulting in low SM utilization? Maybe there are gaps between batches?

@ttyio ttyio added ask-the-experts Module:Performance General performance issues and removed Topic: Triton labels Feb 22, 2021
@githublsk
Copy link
Author

githublsk commented Feb 25, 2021

@nvpohanh Thanks for your response!

Could you profile it using Nsight Systems to see why GPU is sometimes idle, resulting in low SM utilization? Maybe there are gaps between batches?

I have downloaded the CLI Nsight System and use the command as below to monitor our application with batchsize 1.
command:
nsys profile -o tensorrt --trace=cuda --cudabacketrace=all --stats=true --sample=cpu python model_fastercnn_pytorch.py
qdrep file generated:
tensorrt_log

I have below two questions, can you give me some suggestions?

  1. In the qdrep file, how can I get the infomation about "why GPU is sometimes idle?" as you said, I just analysis the CUDA API trace, but I can not get any information about the GPU idle that results in low SM utilization.

  2. When doing the inferrence for the first time using the tensorrt engine, it cost so long time in some cuda operation, as you seen in qdrep file, how can I reduce the some cuda operation time?

The picture information that I uploaded is less, if needed I can upload the qdrep file. Thanks in advance!

@githublsk
Copy link
Author

@nvpohanh can you help me to slove above problem? because I have little experiences about it, so I hope you can give me some suggestion, Thanks in advance.

@nvpohanh
Copy link
Collaborator

The part you showed is just the set up stage. The actual inference at the end. Maybe you can send me the qdrep file so that I can take a quick look.

@githublsk
Copy link
Author

@nvpohanh Thank you very much.
Attached file is my qdrep file, please help me to analyse it, this is a great help for me, thank you.
tensorrt.zip

@nvpohanh
Copy link
Collaborator

nvpohanh commented Mar 1, 2021

2021-03-01 10_15_10

It seems that you only run one batch and time it. Is that correct? To correctly measure the latency of a batch, it is recommended that:

  • You measure the total time for many inferences and fine the average latency, rather than running just one inference.
  • (more advanced) If possible, run the host-to-device copy on a separate CUDA stream and use CUDA events to synchronizations between the main inference stream and the H2D copy stream. You can check the source code of trtexec to find an example for this.

Thanks

@githublsk
Copy link
Author

githublsk commented Mar 1, 2021

@nvpohanh, Thank you for your response.
Yes I just run one batch and time it.
For step one, attached file is 10 times inference for one image, can you help me to analyse it?
tensorrt_10.zip

For step two, I could not follow your suggestion, can you give me relative source code? so I can quickly confirm it.
Thank you very much.

So many trouble to bother you, I am so sorry, but you help us so

@githublsk
Copy link
Author

@nvpohanh can you help me to reslove above question? Thank you in advance.

@nvpohanh
Copy link
Collaborator

nvpohanh commented Mar 2, 2021

For step two, I could not follow your suggestion, can you give me relative source code? so I can quickly confirm it.

Please refer to our trtexec as the example source code.

2021-03-02 10_40_18-NVIDIA Nsight Systems 2020 4 0

The profile looks fine to me. I don't think there is any issue in TRT. There seem to be some Python code overhead between batches. You may want to optimize those if you want to fully utilize GPUs. Could you let me know if you have more questions? Thanks

@nvpohanh
Copy link
Collaborator

nvpohanh commented Mar 2, 2021

For example, you can remove the cuda stream synchronizes between the batches and only synchronize at the end.

@nvpohanh
Copy link
Collaborator

nvpohanh commented Mar 2, 2021

or replace the cudaMemcpy() with cudaMemcpyAsync().

@githublsk
Copy link
Author

Hi @nvpohanh
Attached file inculding our test code and new test qdrep file, can you help me analyse it? Maybe some Python code overhead between batches, can you help us check it? because we have few experience, so many trouble bother you, I am so sorry about it, and many thanks to you.

resource.zip

@githublsk
Copy link
Author

@nvpohanh If you have time, can you help me to reslove above question?

@nvpohanh
Copy link
Collaborator

nvpohanh commented Mar 4, 2021

Hi @githublsk ,
Unfortunately, I am not sure how much I can still help you, as the issue does not lie in TRT, but in other parts of the Python codes outside of TRT. I would recommend that you find some resources about how to deploy TensorRT models in production to achieve maximum GPU utilization. One possibility is to use Triton Inference Server to load TRT engines and let Triton Inference Server schedule the inferences. Thanks

Here are some examples:

@ttyio Anything else you think we can provide help with? Since this is not a TRT-specific issue.

@ttyio
Copy link
Collaborator

ttyio commented Mar 4, 2021

+1 for @nvpohanh 's suggestion to use Triton Inference Server, thanks!

@ttyio ttyio added good-reference triaged Issue has been triaged by maintainers labels Apr 26, 2021
@ttyio
Copy link
Collaborator

ttyio commented May 21, 2021

Close since no activity for more than 3 weeks, please reopen if you still have question, thanks!

@ttyio ttyio closed this as completed May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module:Performance General performance issues triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants