Closed
Description
Here is my basic information
system summary
- windows: 10/16GB RAM
- GPU : GeForce GTX 1080 Ti
- cuda : 8.0
- cudnn: 7.1
- tensorflow-gpu: 1.5
- python: 3.5
data summary
- number of samples: 10000 (training and validation percent are 70% and 30% respectively)
- size of train.records: 744MB
- size of val.records: 334MB
- image (jpg)
- shape:
(resized by opencv) - size: 35kb to 1275kb
- shape:
configuration of training
I'm using faster_rcnn_inception_resnet_v2.config
copied from obeject detection sample config files. Here is the detail
model {
faster_rcnn {
num_classes: 1
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.6
# modify from 300 to 600
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.7
iou_threshold: 0.3
max_detections_per_class: 10
max_total_detections: 40
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 4
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: ""
from_detection_checkpoint: true
num_steps: 100000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: ""
}
label_map_path: ""
}
eval_config: {
num_examples: 1000
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: ""
}
label_map_path: ""
shuffle: false
num_readers: 1
}
as you can read ,the batch_size=4
(decrease from 64 to 4 while made no difference)
error log
Here is the error log
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From D:\workspace\compet\ipcr3\object_detection\trainer.py:176: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
WARNING:tensorflow:From D:\workspace\compet\ipcr3\object_detection\builders\optimizer_builder.py:105: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
C:\Python35\lib\site-packages\tensorflow\python\ops\gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-03-22 15:03:47.552158: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2018-03-22 15:03:47.890510: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.10GiB
2018-03-22 15:03:47.890825: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path D:/workspace/compet/ipcr3/data/ICPR3part1/tf_ckpt\model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
2018-03-22 15:05:22.078875: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 124.51MiB. Current allocation summary follows.
2018-03-22 15:05:22.079202: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:627] Bin (256): Total Chunks: 730, Chunks in use: 666. 182.5KiB allocated for chunks. 166.5KiB in use in bin. 36.1KiB client-requested in use in bin.
......
2018-03-22 15:05:23.036772: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:683] Sum Total of in-use chunks: 8.52GiB
2018-03-22 15:05:23.036933: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:685] Stats:
Limit: 9280555582
InUse: 9147851008
MaxInUse: 9226589952
NumAllocs: 5341
MaxAllocSize: 1278345216
2018-03-22 15:05:23.037782: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\bfc_allocator.cc:277] ****************************************************************************************************
2018-03-22 15:05:23.038097: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\framework\op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[16,38,50,384]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.ResourceExhaustedError'>, OOM when allocating tensor with shape[4,75,100,1088]
[[Node: FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_7/Relu, FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/mul)]]
[[Node: gradients/FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_16/Branch_0/Conv2d_1x1/BatchNorm/FusedBatchNorm_grad/tuple/control_dependency_2/_5985 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_21725_gradients/FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_16/Branch_0/Conv2d_1x1/BatchNorm/FusedBatchNorm_grad/tuple/control_dependency_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Caused by op 'FirstStageFeatureExtractor/InceptionResnetV2/InceptionResnetV2/Repeat_1/block17_8/add', defined at:
other information
I can train on other dataset using the same configuration well,while failed on the dataset described above whatever parameters changes made. Can somebody pull me out?Thank you !
Activity
TheFlashover commentedon Mar 23, 2018
Try to change
to
in
faster_rcnn_inception_resnet_v2.config
shartoo commentedon Mar 24, 2018
@TheFlashover Thank you for your advice but error remain the same
kirk86 commentedon Mar 27, 2018
@TheFlashover @shartoo What I've found really strange is that for me all
faster-rcnn
configs work fine withbatch_size=1
and as soon as I change that to 2 then everything breaks down.Which I can't understand why that's happening and the documentation is scarce in this case.
HelloWorldzyy commentedon Mar 28, 2018
@kirk86 hi, I meet the same problem,do you solve it?could you give me some solution?
kirk86 commentedon Mar 28, 2018
@HelloWorldzyy I don't understand if you're trying to mock me or what? First you down vote a legitimate problem and then you're asking for solution. I can't really understand what you're trying to achieve?
kellenf commentedon Jul 24, 2018
@kirk86 Thanks!your comment solve my problem,but what you said is not really accurate,when my dataset only have one object need to detect,the batch size can be 8,16 and so on.
however,When my dataset has 7 object ,the batch size only should be 1!
But I don't know why,but it must caused by the details in faster rcnn,I would read the paper again carefully.If you can give me the answer,I will appreciate it very much!!!
mawanda-jun commentedon Aug 19, 2018
Hi,
I don't know if you solved your problem yet, but the problem is that your training set is made up of different image size. As the neural networks that are inside the algorithms needs equal dimensions matrix, the error arise.
I suggest you to change your resizer from
keep_aspect_ration_resizer { min_dimension: 600 max_dimension: 1024 }
To:
fixed_shape_resizer { width: 600 height: 800 }
Or whatever you want. It depends on how much RAM and computational power you have.
I'm not such an expert but for me it worked.
Remember also that the bigger batch size you have the more optimum results you have in general, however this is valid since all your batches stays in RAM (or GPU). If your PC starts swapping then you should consider to reduce the shape resizer or reduce batch size, since you'll end up in slowing down your computation due to problems with throughput.
Hope this helped out, however if somebody is more skilled than me than listen to him!
dshahrokhian commentedon Oct 1, 2018
I want to add an additional option to the ones mentioned above. As a summary, there are 3 possible solutions:
pad_to_max_dimension : true
inkeep_aspect_ratio_resizer
:1
:fixed_shape_resizer
instead ofkeep_aspect_ratio_resizer
:pedramtehranchi commentedon Aug 3, 2019
same problem
tensorflowbutler commentedon Jan 29, 2020
Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.