trtserver uses more than 20 CPUs #1080

lxl910915 · 2020-02-02T06:10:51Z

Description
Model: Tensorflow EAST model
Convert saved_model to onnx: python -m tf2onnx.convert --saved-model /tmp/SavedModel --output model.onnx --outputs feature_fusion/concat_3,feature_fusion/Conv_7/Sigmoid --opset 10
Result of convert saved_model to onnx: After optimization: Add -3 (19->16), Const -59 (379->320), Gather +3 (0->3), Identity -18 (18->0), Reshape +2 (0->2), Transpose -262 (264->2)

trtserver loads this model.onnx. When a client using gRPC to connect to the trtserver. The trtserver will use more than 20 CPUs and use less GPU.

However, when we add --fold_const to convert saved_model to onnx by python -m tf2onnx.convert --saved-model /tmp/SavedModel --output model.onnx --outputs feature_fusion/concat_3,feature_fusion/Conv_7/Sigmoid --opset 10 --fold_const.
Result of convert saved_model to onnx: After optimization: Add -63 (79->16), Const -10 (145->135), Identity -18 (18->0), Reshape +2 (0->2), Transpose -138 (140->2)

Now, the trtserver only uses 1 CPU and uses more GPU.

TRTIS Information
What version of TRTIS are you using? 20.01
Are you using the TRTIS container or did you build it yourself? build it ourself

To Reproduce
Steps to reproduce the behavior:

Expected behavior
A clear and concise description of what you expected to happen.

The text was updated successfully, but these errors were encountered:

deadeyegoodwin · 2020-02-03T18:11:29Z

TRTIS uses ONNX Runtime to execute ONNX models https://github.com/microsoft/onnxruntime. I think this is more of an ONNX Runtime question as to why CPU vs GPU is used to execute the model. Are you setting any instance_group or optimization options in your model configuration?
@GuanLuo do we have a script to run ONNX models directly with ONNX runtime so we can compare behavior?

lxl910915 · 2020-02-04T07:22:24Z

TRTIS uses ONNX Runtime to execute ONNX models https://github.com/microsoft/onnxruntime. I think this is more of an ONNX Runtime question as to why CPU vs GPU is used to execute the model. Are you setting any instance_group or optimization options in your model configuration?
@GuanLuo do we have a script to run ONNX models directly with ONNX runtime so we can compare behavior?

@deadeyegoodwin Our config.pbtxt is:

name: "east"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [1,256,256,3]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [1,64,64,1]
  },
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [1,64,64,5]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]
optimization { execution_accelerators {
  gpu_execution_accelerator : [ { name : "tensorrt" } ]
}}

If we use TRTIS and remove instance_group and optimization in config.pbtxt, trtserver also uses more than 20 CPUs.

What's more, we run ONNX model directly with ONNX runtime (enable tensorrt), and it only uses 1 cpu.

ONNX model download ：
链接: https://pan.baidu.com/s/1HcxZiFGDg6AJS939FJL9kw 提取码: shu1
Run this ONNX model:

import onnxruntime
import numpy as np
ONNX_PATH = "/tmp/east_model.onnx"
image = np.ones((1,256,256,3), dtype=np.float32)
session = onnxruntime.InferenceSession(ONNX_PATH)
ort_in = {session.get_inputs()[0].name: image}
while True:
    session.run(None, ort_in)

We perf the trtserver, and get its flame graph:

Flame graph file is:
trtis-cpu.zip

It shows that libgomp.so.1.0.0 takes a lot of CPU. And our GCC version is 7.1.0.

Next, we will build TRTIS using debug model, and perf again.

GuanLuo · 2020-02-04T20:03:24Z

@deadeyegoodwin I have a Dockerfile that builds ONNX Runtime and an sample executable to load / run model in ONNX Runtime directly, but that is out-of-date (before ONNX Runtime v1.0.0)... I can revisit it and post it here later this week.

lxl910915 · 2020-02-05T11:44:56Z

@GuanLuo
We test the following code:

  const OrtApi* g_ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);
  OrtEnv* env;
  g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env);

  OrtSessionOptions* session_options;
  g_ort->CreateSessionOptions(&session_options);
  g_ort->SetIntraOpNumThreads(session_options, 1);
  g_ort->SetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_BASIC);

  OrtSessionOptionsAppendExecutionProvider_CUDA(session_options, 0);
  OrtSessionOptionsAppendExecutionProvider_Tensorrt(session_options);

  OrtSession* session;
  const char* model_path = "/tmp/east_model.onnx";
  printf("Using Onnxruntime C API\n");
  RETURN_IF_ORT_ERROR(g_ort->CreateSession(env, model_path, session_options, &session));
  printf("Created Session");
  while(true) {
    printf("Run Session\n");
    RETURN_IF_ORT_ERROR(g_ort->Run(
      session, NULL, input_names.data(),
      (const OrtValue* const*)input_tensors_.data(), input_tensors_.size(),
      output_names.data(), output_names.size(), output_tensors_.data()));
  }

It works well.

GuanLuo · 2020-02-05T18:30:35Z

@lxl910915 What is the ONNX Runtime version that you are using? The above code is similar to how we invoke the ORT APis, except that for now we always set disable in SetSessionGraphOptimizationLevel.

By the way, the order you call OrtSessionOptionsAppendExecutionProvider_XXX affects the priority of assigning ops to the execution provider. Your example prioritize CUDA over TensorRT, but TRTIS will prioritize TensorRT over CUDA if TensoRT EP is specified.

lxl910915 · 2020-02-06T06:09:23Z

@GuanLuo Thanks for you reply.

Our ONNX Runtime version is 1.1.0.
We also try CUDA->TensorRT or TensorRT->CUDA, and they all works well.
If we prioritize CUDA over TensorRT in TRTIS, TRTIS will coredump.
Finally, if we set disable in SetSessionGraphOptimizationLevel, it consume a lot of CPUs. So we set ORT_ENABLE_BASIC in SetSessionGraphOptimizationLevel, and TRTISonly uses 1 CPU. But the inference speed decreased 10%.

What's more, if we set ORT_ENABLE_BASIC in SetSessionGraphOptimizationLevel and then enable dynamic batch, the problem is still there. If we add --fold_const for tf2onnx.convert, the trtserver only uses 1 CPU.

GuanLuo · 2020-02-07T00:51:53Z

I assume you achieve 3. by changing the source code? Otherwise I think TRTIS always prioritize other GPU accelerator over CUDA. If so, can you share the code change? It is strange that the order changes causes exception on TRTIS side.

Are you building TRTIS on 20.01 branch or master? The master is now rolled forward to use ONNX Runtime 1.1.0. If you already built on master, then we should investigate further...

lxl910915 · 2020-02-07T11:11:55Z

@GuanLuo In onnx_backend.cc file, we changed

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

to

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;

    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

We build the master branch, and problem still exist.

lxl910915 · 2020-02-17T08:31:50Z

@GuanLuo In onnx_backend.cc file, we changed

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

to

  if (gpu_device != Context::NO_GPU_DEVICE) {
#ifdef TRTIS_ENABLE_GPU
    RETURN_IF_ORT_ERROR(OrtSessionOptionsAppendExecutionProvider_CUDA(
        session_options, gpu_device));
    LOG_VERBOSE(1) << "CUDA Execution Accelerator is set for " << instance_name
                   << " on device " << gpu_device;

    if (Config().optimization().has_execution_accelerators()) {
      ...
    }
#else
    return Status(RequestStatusCode::INTERNAL, "GPU instances not supported");
#endif  // TRTIS_ENABLE_GPU
  }

We build the master branch, and problem still exist.

Same problem with onnx/tensorflow-onnx#784 (comment)

GuanLuo · 2020-02-20T21:45:21Z

So the root cause of the issue is from ONNXRuntime?

lxl910915 · 2020-03-08T02:53:26Z

So the root cause of the issue is from ONNXRuntime?

It seems that.

lxl910915 mentioned this issue Feb 17, 2020

dynamic shape onnx/tensorflow-onnx#784

Closed

lxl910915 closed this as completed Mar 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

trtserver uses more than 20 CPUs #1080

trtserver uses more than 20 CPUs #1080

lxl910915 commented Feb 2, 2020 •

edited

Loading

deadeyegoodwin commented Feb 3, 2020

Uh oh!

lxl910915 commented Feb 4, 2020 •

edited

Loading

Uh oh!

GuanLuo commented Feb 4, 2020

Uh oh!

lxl910915 commented Feb 5, 2020 •

edited

Loading

Uh oh!

GuanLuo commented Feb 5, 2020

Uh oh!

lxl910915 commented Feb 6, 2020 •

edited

Loading

Uh oh!

GuanLuo commented Feb 7, 2020

Uh oh!

lxl910915 commented Feb 7, 2020 •

edited

Loading

Uh oh!

lxl910915 commented Feb 17, 2020

Uh oh!

GuanLuo commented Feb 20, 2020

Uh oh!

lxl910915 commented Mar 8, 2020

Uh oh!

trtserver uses more than 20 CPUs #1080

trtserver uses more than 20 CPUs #1080

Comments

lxl910915 commented Feb 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

deadeyegoodwin commented Feb 3, 2020

Uh oh!

lxl910915 commented Feb 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuanLuo commented Feb 4, 2020

Uh oh!

lxl910915 commented Feb 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuanLuo commented Feb 5, 2020

Uh oh!

lxl910915 commented Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuanLuo commented Feb 7, 2020

Uh oh!

lxl910915 commented Feb 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lxl910915 commented Feb 17, 2020

Uh oh!

GuanLuo commented Feb 20, 2020

Uh oh!

lxl910915 commented Mar 8, 2020

Uh oh!

lxl910915 commented Feb 2, 2020 •

edited

Loading

lxl910915 commented Feb 4, 2020 •

edited

Loading

lxl910915 commented Feb 5, 2020 •

edited

Loading

lxl910915 commented Feb 6, 2020 •

edited

Loading

lxl910915 commented Feb 7, 2020 •

edited

Loading