Skip to content

object detect api fasterrcnn OOM #3697

Closed
@shartoo

Description

@shartoo

Here is my basic information

system summary

  • windows: 10/16GB RAM
  • GPU : GeForce GTX 1080 Ti
  • cuda : 8.0
  • cudnn: 7.1
  • tensorflow-gpu: 1.5
  • python: 3.5

data summary

  • number of samples: 10000 (training and validation percent are 70% and 30% respectively)
  • size of train.records: 744MB
  • size of val.records: 334MB
  • image (jpg)
    • shape: 640 × 480 (resized by opencv)
    • size: 35kb to 1275kb

configuration of training

I'm using faster_rcnn_inception_resnet_v2.config copied from obeject detection sample config files. Here is the detail

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.6
    # modify from 300 to  600
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.7
		iou_threshold: 0.3
        max_detections_per_class: 10 
        max_total_detections: 40
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 4
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: ""
  from_detection_checkpoint: true
  num_steps: 100000
  data_augmentation_options {
    random_horizontal_flip {
    }
	}
}

train_input_reader: {
  tf_record_input_reader {
    input_path: ""
  }
  label_map_path: ""
}

eval_config: {
  num_examples: 1000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: ""
  }
  label_map_path: ""
  shuffle: false
  num_readers: 1
}

as you can read ,the batch_size=4(decrease from 64 to 4 while made no difference)

error log

Here is the error log

INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From D:\workspace\compet\ipcr3\object_detection\trainer.py:176: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From D:\workspace\compet\ipcr3\object_detection\builders\optimizer_builder.py:105: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
C:\Python35\lib\site-packages\tensorflow\python\ops\gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-03-22 15:03:47.552158: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-22 15:03:47.890510: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-03-22 15:03:47.890825: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path D:/workspace/compet/ipcr3/data/ICPR3part1/tf_ckpt\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2018-03-22 15:05:22.078875: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 124.51MiB.  Current allocation summary follows.
2018-03-22 15:05:22.079202: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:627] Bin (256): 	Total Chunks: 730, Chunks in use: 666. 182.5KiB allocated for chunks. 166.5KiB in use in bin. 36.1KiB client-requested in use in bin.
......
2018-03-22 15:05:23.036772: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:683] Sum Total of in-use chunks: 8.52GiB
2018-03-22 15:05:23.036933: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:685] Stats: 
Limit:                  9280555582
InUse:                  9147851008
MaxInUse:               9226589952
NumAllocs:                    5341
MaxAllocSize:           1278345216

2018-03-22 15:05:23.037782: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] ****************************************************************************************************
2018-03-22 15:05:23.038097: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,38,50,384]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[4,75,100,1088]
	 [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_7/Relu, FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/mul)]]
	 [[Node: gradients/FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_16/Branch_0/Conv2d_1x1/BatchNorm/FusedBatchNorm_grad/tuple/control_dependency_2/_5985 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21725_gradients/FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_16/Branch_0/Conv2d_1x1/BatchNorm/FusedBatchNorm_grad/tuple/control_dependency_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/add', defined at:

other information

I can train on other dataset using the same configuration well,while failed on the dataset described above whatever parameters changes made. Can somebody pull me out?Thank you !

Activity

TheFlashover

TheFlashover commented on Mar 23, 2018

@TheFlashover

Try to change

train_config: {
  batch_size: 4

to

train_config: {
  batch_size: 1

in faster_rcnn_inception_resnet_v2.config

shartoo

shartoo commented on Mar 24, 2018

@shartoo
Author

@TheFlashover Thank you for your advice but error remain the same

kirk86

kirk86 commented on Mar 27, 2018

@kirk86

@TheFlashover @shartoo What I've found really strange is that for me all faster-rcnn configs work fine with batch_size=1 and as soon as I change that to 2 then everything breaks down.

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,500,625,3] vs. shape[1] = [1,500,500,3]
         [[Node: concat_1 = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Preprocessor/sub, Preprocessor_1/sub, gradients/Gather_grad/concat/axis)]]
         [[Node: gradients/FirstStageFeatureExtractor/resnet_v1_101/resnet_v1_101/block1/unit_1/bottleneck_v1/shortcut/BatchNorm/FusedBatchNorm_grad/FusedBatchNormGrad/_4233 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_13772...chNormGrad", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Which I can't understand why that's happening and the documentation is scarce in this case.

HelloWorldzyy

HelloWorldzyy commented on Mar 28, 2018

@HelloWorldzyy

@kirk86 hi, I meet the same problem,do you solve it?could you give me some solution?

kirk86

kirk86 commented on Mar 28, 2018

@kirk86

@HelloWorldzyy I don't understand if you're trying to mock me or what? First you down vote a legitimate problem and then you're asking for solution. I can't really understand what you're trying to achieve?

kellenf

kellenf commented on Jul 24, 2018

@kellenf

@kirk86 Thanks!your comment solve my problem,but what you said is not really accurate,when my dataset only have one object need to detect,the batch size can be 8,16 and so on.
however,When my dataset has 7 object ,the batch size only should be 1!
But I don't know why,but it must caused by the details in faster rcnn,I would read the paper again carefully.If you can give me the answer,I will appreciate it very much!!!

mawanda-jun

mawanda-jun commented on Aug 19, 2018

@mawanda-jun

Hi,
I don't know if you solved your problem yet, but the problem is that your training set is made up of different image size. As the neural networks that are inside the algorithms needs equal dimensions matrix, the error arise.
I suggest you to change your resizer from
keep_aspect_ration_resizer { min_dimension: 600 max_dimension: 1024 }
To:
fixed_shape_resizer { width: 600 height: 800 }
Or whatever you want. It depends on how much RAM and computational power you have.
I'm not such an expert but for me it worked.
Remember also that the bigger batch size you have the more optimum results you have in general, however this is valid since all your batches stays in RAM (or GPU). If your PC starts swapping then you should consider to reduce the shape resizer or reduce batch size, since you'll end up in slowing down your computation due to problems with throughput.

Hope this helped out, however if somebody is more skilled than me than listen to him!

dshahrokhian

dshahrokhian commented on Oct 1, 2018

@dshahrokhian

I want to add an additional option to the ones mentioned above. As a summary, there are 3 possible solutions:

  1. Add pad_to_max_dimension : true in keep_aspect_ratio_resizer:
keep_aspect_ratio_resizer {
  pad_to_max_dimension : true
}
  1. Change batch size to 1:
train_config: {
  batch_size: 1
}
  1. Use fixed_shape_resizer instead of keep_aspect_ratio_resizer:
fixed_shape_resizer { width: <pixels> height: <pixels> }
pedramtehranchi

pedramtehranchi commented on Aug 3, 2019

@pedramtehranchi

same problem

tensorflowbutler

tensorflowbutler commented on Jan 29, 2020

@tensorflowbutler
Member

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @kirk86@shartoo@dshahrokhian@HelloWorldzyy@kellenf

        Issue actions

          object detect api fasterrcnn OOM · Issue #3697 · tensorflow/models