object detect api fasterrcnn OOM

Here is my basic information

### system summary

+ windows: 10/16GB RAM
+ GPU :  GeForce GTX 1080 Ti
+ cuda : 8.0 
+ cudnn:  7.1
+ tensorflow-gpu:  1.5
+ python:  3.5

### data summary

+ number of samples: 10000 (training and validation percent are 70% and 30% respectively) 
+ size of train.records: 744MB
+ size of val.records:  334MB
+ image (jpg)
  + shape: $640\times 480$ (resized by opencv)
  + size: 35kb to 1275kb

### configuration of training

I'm using `faster_rcnn_inception_resnet_v2.config` copied from obeject detection sample config files. Here is the detail 

```
model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.6
    # modify from 300 to  600
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.7
		iou_threshold: 0.3
        max_detections_per_class: 10 
        max_total_detections: 40
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 4
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: ""
  from_detection_checkpoint: true
  num_steps: 100000
  data_augmentation_options {
    random_horizontal_flip {
    }
	}
}

train_input_reader: {
  tf_record_input_reader {
    input_path: ""
  }
  label_map_path: ""
}

eval_config: {
  num_examples: 1000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: ""
  }
  label_map_path: ""
  shuffle: false
  num_readers: 1
}
```
as you can read ,the `batch_size=4`(decrease from 64 to 4 while made no difference)


### error log 

Here is the error log
```
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From D:\workspace\compet\ipcr3\object_detection\trainer.py:176: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From D:\workspace\compet\ipcr3\object_detection\builders\optimizer_builder.py:105: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
C:\Python35\lib\site-packages\tensorflow\python\ops\gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-03-22 15:03:47.552158: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-22 15:03:47.890510: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-03-22 15:03:47.890825: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path D:/workspace/compet/ipcr3/data/ICPR3part1/tf_ckpt\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2018-03-22 15:05:22.078875: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 124.51MiB.  Current allocation summary follows.
2018-03-22 15:05:22.079202: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:627] Bin (256): 	Total Chunks: 730, Chunks in use: 666. 182.5KiB allocated for chunks. 166.5KiB in use in bin. 36.1KiB client-requested in use in bin.
......
2018-03-22 15:05:23.036772: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:683] Sum Total of in-use chunks: 8.52GiB
2018-03-22 15:05:23.036933: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:685] Stats: 
Limit:                  9280555582
InUse:                  9147851008
MaxInUse:               9226589952
NumAllocs:                    5341
MaxAllocSize:           1278345216

2018-03-22 15:05:23.037782: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] ****************************************************************************************************
2018-03-22 15:05:23.038097: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,38,50,384]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[4,75,100,1088]
	 [[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_7/Relu, FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/mul)]]
	 [[Node: gradients/FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_16/Branch_0/Conv2d_1x1/BatchNorm/FusedBatchNorm_grad/tuple/control_dependency_2/_5985 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21725_gradients/FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_16/Branch_0/Conv2d_1x1/BatchNorm/FusedBatchNorm_grad/tuple/control_dependency_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/add', defined at:

```

### other information 

I can train on other dataset using the same configuration well,while failed on the dataset described above whatever parameters changes made. Can somebody pull me out?Thank you !

 







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

object detect api fasterrcnn OOM #3697

system summary

data summary

configuration of training

error log

other information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

object detect api fasterrcnn OOM #3697

Description

system summary

data summary

configuration of training

error log

other information

Activity

TheFlashover commented on Mar 23, 2018

shartoo commented on Mar 24, 2018

kirk86 commented on Mar 27, 2018

HelloWorldzyy commented on Mar 28, 2018

kirk86 commented on Mar 28, 2018

kellenf commented on Jul 24, 2018

mawanda-jun commented on Aug 19, 2018

dshahrokhian commented on Oct 1, 2018

pedramtehranchi commented on Aug 3, 2019

tensorflowbutler commented on Jan 29, 2020

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions