Skip to content

trtserver uses more than 20 CPUs #1080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lxl910915 opened this issue Feb 2, 2020 · 11 comments
Closed

trtserver uses more than 20 CPUs #1080

lxl910915 opened this issue Feb 2, 2020 · 11 comments

Comments

@lxl910915
Copy link

lxl910915 commented Feb 2, 2020

Description
Model: Tensorflow EAST model
Convert saved_model to onnx: python -m tf2onnx.convert --saved-model /tmp/SavedModel --output model.onnx --outputs feature_fusion/concat_3,feature_fusion/Conv_7/Sigmoid --opset 10
Result of convert saved_model to onnx: After optimization: Add -3 (19->16), Const -59 (379->320), Gather +3 (0->3), Identity -18 (18->0), Reshape +2 (0->2), Transpose -262 (264->2)

trtserver loads this model.onnx. When a client using gRPC to connect to the trtserver. The trtserver will use more than 20 CPUs and use less GPU.
img

However, when we add --fold_const to convert saved_model to onnx by python -m tf2onnx.convert --saved-model /tmp/SavedModel --output model.onnx --outputs feature_fusion/concat_3,feature_fusion/Conv_7/Sigmoid --opset 10 --fold_const.
Result of convert saved_model to onnx: After optimization: Add -63 (79->16), Const -10 (145->135), Identity -18 (18->0), Reshape +2 (0->2), Transpose -138 (140->2)

Now, the trtserver only uses 1 CPU and uses more GPU.

TRTIS Information
What version of TRTIS are you using? 20.01
Are you using the TRTIS container or did you build it yourself? build it ourself

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

@deadeyegoodwin
Copy link
Contributor

TRTIS uses ONNX Runtime to execute ONNX models https://github.com/microsoft/onnxruntime. I think this is more of an ONNX Runtime question as to why CPU vs GPU is used to execute the model. Are you setting any instance_group or optimization options in your model configuration?
@GuanLuo do we have a script to run ONNX models directly with ONNX runtime so we can compare behavior?

@lxl910915
Copy link
Author

lxl910915 commented Feb 4, 2020

TRTIS uses ONNX Runtime to execute ONNX models https://github.com/microsoft/onnxruntime. I think this is more of an ONNX Runtime question as to why CPU vs GPU is used to execute the model. Are you setting any instance_group or optimization options in your model configuration?
@GuanLuo do we have a script to run ONNX models directly with ONNX runtime so we can compare behavior?

@deadeyegoodwin Our config.pbtxt is:

name: "east"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [1,256,256,3]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [1,64,64,1]
  },
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [1,64,64,5]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]
optimization { execution_accelerators {
  gpu_execution_accelerator : [ { name : "tensorrt" } ]
}}

If we use TRTIS and remove instance_group and optimization in config.pbtxt, trtserver also uses more than 20 CPUs.

What's more, we run ONNX model directly with ONNX runtime (enable tensorrt), and it only uses 1 cpu.

ONNX model download :
链接: https://pan.baidu.com/s/1HcxZiFGDg6AJS939FJL9kw 提取码: shu1
Run this ONNX model:

import onnxruntime
import numpy as np
ONNX_PATH = "/tmp/east_model.onnx"
image = np.ones((1,256,256,3), dtype=np.float32)
session = onnxruntime.InferenceSession(ONNX_PATH)
ort_in = {session.get_inputs()[0].name: image}
while True:
    session.run(None, ort_in)

We perf the trtserver, and get its flame graph:
image

Flame graph file is:
trtis-cpu.zip

It shows that libgomp.so.1.0.0 takes a lot of CPU. And our GCC version is 7.1.0.

Next, we will build TRTIS using debug model, and perf again.

@GuanLuo
Copy link
Contributor

GuanLuo commented Feb 4, 2020

@deadeyegoodwin I have a Dockerfile that builds ONNX Runtime and an sample executable to load / run model in ONNX Runtime directly, but that is out-of-date (before ONNX Runtime v1.0.0)... I can revisit it and post it here later this week.

@lxl910915
Copy link
Author

lxl910915 commented Feb 5, 2020

@GuanLuo
We test the following code:

  const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
  OrtEnv* env;
  g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env);

  OrtSessionOptions* session_options;
  g_ort->CreateSessionOptions(&session_options);
  g_ort->SetIntraOpNumThreads(session_options, 1);
  g_ort->SetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_BASIC);

  OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0);
  OrtSessionOptionsAppendExecutionProvider_Tensorrt(session_options);

  OrtSession* session;
  const char* model_path = "/tmp/east_model.onnx";
  printf("Using Onnxruntime C API\n");
  RETURN_IF_ORT_ERROR(g_ort->CreateSession(env, model_path, session_options, &session));
  printf("Created Session");
  while(true) {
    printf("Run Session\n");
    RETURN_IF_ORT_ERROR(g_ort->Run(
      session, NULL, input_names.data(),
      (const OrtValue* const*)input_tensors_.data(), input_tensors_.size(),
      output_names.data(), output_names.size(), output_tensors_.data()));
  }

It works well.

@GuanLuo
Copy link
Contributor

GuanLuo commented Feb 5, 2020

@lxl910915 What is the ONNX Runtime version that you are using? The above code is similar to how we invoke the ORT APis, except that for now we always set disable in SetSessionGraphOptimizationLevel.

By the way, the order you call OrtSessionOptionsAppendExecutionProvider_XXX affects the priority of assigning ops to the execution provider. Your example prioritize CUDA over TensorRT, but TRTIS will prioritize TensorRT over CUDA if TensoRT EP is specified.

@lxl910915
Copy link
Author

lxl910915 commented Feb 6, 2020

@GuanLuo Thanks for you reply.

  1. Our ONNX Runtime version is 1.1.0.

  2. We also try CUDA->TensorRT or TensorRT->CUDA, and they all works well.

  3. If we prioritize CUDA over TensorRT in TRTIS, TRTIS will coredump.

  4. Finally, if we set disable in SetSessionGraphOptimizationLevel, it consume a lot of CPUs. So we set ORT_ENABLE_BASIC in SetSessionGraphOptimizationLevel, and TRTISonly uses 1 CPU. But the inference speed decreased 10%.

What's more, if we set ORT_ENABLE_BASIC in SetSessionGraphOptimizationLevel and then enable dynamic batch, the problem is still there. If we add --fold_const for tf2onnx.convert, the trtserver only uses 1 CPU.

@GuanLuo
Copy link
Contributor

GuanLuo commented Feb 7, 2020

I assume you achieve 3. by changing the source code? Otherwise I think TRTIS always prioritize other GPU accelerator over CUDA. If so, can you share the code change? It is strange that the order changes causes exception on TRTIS side.

Are you building TRTIS on 20.01 branch or master? The master is now rolled forward to use ONNX Runtime 1.1.0. If you already built on master, then we should investigate further...

@lxl910915
Copy link
Author

lxl910915 commented Feb 7, 2020

@GuanLuo In onnx_backend.cc file, we changed

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

to

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;

    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

We build the master branch, and problem still exist.

@lxl910915
Copy link
Author

@GuanLuo In onnx_backend.cc file, we changed

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

to

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;

    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

We build the master branch, and problem still exist.

Same problem with onnx/tensorflow-onnx#784 (comment)

@GuanLuo
Copy link
Contributor

GuanLuo commented Feb 20, 2020

So the root cause of the issue is from ONNXRuntime?

@lxl910915
Copy link
Author

So the root cause of the issue is from ONNXRuntime?

It seems that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants