Skip to content

Set training=True in BatchNormalization layer causes evaluation error in custom Estimator model #16455

Closed
@CasiaFan

Description

@CasiaFan

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu16.04
  • TensorFlow installed from (source or binary): pip install
  • TensorFlow version (use command below): 1.4.1
  • Python version: 3.5.2
  • CUDA/cuDNN version: 8.0/6
  • GPU model and memory: GeForce 1080ti, 11G

Describe the problem

I define a custom estimator model for classification following this document. Cifar10 dataset is used for test and network framework is xception rewritten in tensorflow. But when using estimator.train_and_evaluate() to train and evaluate the model repeatedly, I find evaluation accuracy dones't improve while training accuracy is normally increasing with training. Inspired by tensorflow official resnet estimator example:

  with tf.variable_scope('conv_layer1'):
    net = tf.layers.conv2d(
        x,
        filters=64,
        kernel_size=7,
        activation=tf.nn.relu)
    net = tf.layers.batch_normalization(net)   # no training status, default is False

I turn off training=is_training in tf.layers.batch_normalization(), both training and evaluation do work normally. For estimator model_fn is used multiple times (see issue 13895), so is this issue related to graph reuse in BN layer and if training status option could be set to enable BN layer to act differently during training and evaluation/predict?

BTW, same issue occurs when using keras or slim instead of tf.layers to construct network architecture.

Source code / logs

Network architecture:

def tf_xception(features, input_shape, pooling=None, classes=2, is_training=True):
    # is_training = False  # manually set False to disable training option
    x = tf.layers.conv2d(features, 32, (3, 3), strides=(2, 2), use_bias=False, name='block1_conv1')
    x = tf.layers.batch_normalization(x, training=is_training, name='block1_conv1_bn')
    x = tf.nn.relu(x, name='block1_conv1_act')
    x = tf.layers.conv2d(x, 64, (3, 3), use_bias=False, name='block1_conv2')
    x = tf.layers.batch_normalization(x, training=is_training, name='block1_conv2_bn')
    x = tf.nn.relu(x, name='block1_conv2_act')

    residual = tf.layers.conv2d(x, 128, (1, 1), strides=(2, 2),
                      padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.layers.separable_conv2d(x, 128, (3, 3), padding='same', use_bias=False, name='block2_sepconv1')
    x = tf.layers.batch_normalization(x, training=is_training, name='block2_sepconv1_bn')
    x = tf.nn.relu(x, name='block2_sepconv2_act')
    x = tf.layers.separable_conv2d(x, 128, (3, 3), padding='same', use_bias=False, name='block2_sepconv2')
    x = tf.layers.batch_normalization(x, training=is_training, name='block2_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(2, 2), padding='same', name='block2_pool')
    x = tf.add(x, residual, name='block2_add')

    residual = tf.layers.conv2d(x, 256, (1, 1), strides=(2, 2),
                      padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.nn.relu(x, name='block3_sepconv1_act')
    x = tf.layers.separable_conv2d(x, 256, (3, 3), padding='same', use_bias=False, name='block3_sepconv1')
    x = tf.layers.batch_normalization(x, training=is_training, name='block3_sepconv1_bn')
    x = tf.nn.relu(x, name='block3_sepconv2_act')
    x = tf.layers.separable_conv2d(x, 256, (3, 3), padding='same', use_bias=False, name='block3_sepconv2')
    x = tf.layers.batch_normalization(x, training=is_training, name='block3_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(2, 2), padding='same', name='block3_pool')
    x = tf.add(x, residual, name="block3_add")

    residual = tf.layers.conv2d(x, 728, (1, 1), strides=(2, 2),
                      padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.nn.relu(x, name='block4_sepconv1_act')
    x = tf.layers.separable_conv2d(x, 728, (3, 3), padding='same', use_bias=False, name='block4_sepconv1')
    x = tf.layers.batch_normalization(x, training=is_training, name='block4_sepconv1_bn')
    x = tf.nn.relu(x, name='block4_sepconv2_act')
    x = tf.layers.separable_conv2d(x, 728, (3, 3), padding='same', use_bias=False, name='block4_sepconv2')
    x = tf.layers.batch_normalization(x, training=is_training, name='block4_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(2, 2), padding='same', name='block4_pool')
    x = tf.add(x, residual, name="block4_add")

    for i in range(8):
        residual = x
        prefix = 'block' + str(i + 5)

        x = tf.nn.relu(x, name=prefix + '_sepconv1_act')
        x = tf.layers.separable_conv2d(x, 728, (3, 3), padding='same', use_bias=False, name=prefix + '_sepconv1')
        x = tf.layers.batch_normalization(x, training=is_training, name=prefix + '_sepconv1_bn')
        x = tf.nn.relu(x, name=prefix + '_sepconv2_act')
        x = tf.layers.separable_conv2d(x, 728, (3, 3), padding='same', use_bias=False, name=prefix + '_sepconv2')
        x = tf.layers.batch_normalization(x, training=is_training, name=prefix + '_sepconv2_bn')
        x = tf.nn.relu(x, name=prefix + '_sepconv3_act')
        x = tf.layers.separable_conv2d(x, 728, (3, 3), padding='same', use_bias=False, name=prefix + '_sepconv3')
        x = tf.layers.batch_normalization(x, training=is_training, name=prefix + '_sepconv3_bn')

        x = tf.add(x, residual, name=prefix+"_add")

    residual = tf.layers.conv2d(x, 1024, (1, 1), strides=(2, 2),
                      padding='same', use_bias=False)
    residual = tf.layers.batch_normalization(residual, training=is_training)

    x = tf.nn.relu(x, name='block13_sepconv1_act')
    x = tf.layers.separable_conv2d(x, 728, (3, 3), padding='same', use_bias=False, name='block13_sepconv1')
    x = tf.layers.batch_normalization(x, training=is_training, name='block13_sepconv1_bn')
    x = tf.nn.relu(x, name='block13_sepconv2_act')
    x = tf.layers.separable_conv2d(x, 1024, (3, 3), padding='same', use_bias=False, name='block13_sepconv2')
    x = tf.layers.batch_normalization(x, training=is_training, name='block13_sepconv2_bn')

    x = tf.layers.max_pooling2d(x, (3, 3), strides=(2, 2), padding='same', name='block13_pool')
    x = tf.add(x, residual, name="block13_add")

    x = tf.layers.separable_conv2d(x, 1536, (3, 3), padding='same', use_bias=False, name='block14_sepconv1')
    x = tf.layers.batch_normalization(x, training=is_training, name='block14_sepconv1_bn')
    x = tf.nn.relu(x, name='block14_sepconv1_act')

    x = tf.layers.separable_conv2d(x, 2048, (3, 3), padding='same', use_bias=False, name='block14_sepconv2')
    x = tf.layers.batch_normalization(x, training=is_training, name='block14_sepconv2_bn')
    x = tf.nn.relu(x, name='block14_sepconv2_act')
    # replace conv layer with fc
    x = tf.layers.average_pooling2d(x, (3, 3), (2, 2), name="global_average_pooling")
    x = tf.layers.conv2d(x, 2048, [1, 1], activation=None, name="block15_conv1")
    x = tf.layers.conv2d(x, classes, [1, 1], activation=None, name="block15_conv2")
    x = tf.squeeze(x, axis=[1, 2], name="logits")
    return x

model_fn:

def model_fn(features, labels, mode, params):
    # check if training stage
    if mode == tf.estimator.ModeKeys.TRAIN:
        is_training = True
    else:
        is_training = False
    input_tensor = features["input"]
    logits = tf_xception(input_tensor, input_shape=(96, 96, 3), classes=10, is_training=is_training)
    probs = tf.nn.softmax(logits, name="output_score")
    predictions = tf.argmax(probs, axis=-1, name="output_label")
    onehot_labels = tf.one_hot(tf.cast(labels, tf.int32), 10)
    predictions_dict = {"score": probs,
                        "label": predictions}
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions_output = tf.estimator.export.PredictOutput(predictions_dict)
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions_dict,
                                          export_outputs={
                                              tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: predictions_output
                                          })
    # calculate loss
    loss = tf.losses.softmax_cross_entropy(onehot_labels, logits)
    accuracy = tf.metrics.accuracy(labels=labels,
                                   predictions=predictions)
    if mode == tf.estimator.ModeKeys.TRAIN:
        lr = params.learning_rate
        # train optimizer
        optimizer = tf.train.RMSPropOptimizer(learning_rate=lr, decay=0.9)
        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
        tensors_to_log = {'batch_accuracy': accuracy[1]}
        logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=1000)
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train_op,
                                          training_hooks=[logging_hook])
    else:
        eval_metric_ops = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          eval_metric_ops=eval_metric_ops)

If the training status set True in training and False in evaluation

train logs:

INFO:tensorflow:Saving checkpoints for 1 into train_episode5/model.ckpt.
INFO:tensorflow:loss = 2.3025837, step = 1
INFO:tensorflow:batch_accuracy = 0.109375
INFO:tensorflow:global_step/sec: 3.50983
INFO:tensorflow:loss = 2.3048878, step = 101 (28.492 sec)
INFO:tensorflow:Saving checkpoints for 185 into train_episode5/model.ckpt.
INFO:tensorflow:Loss for final step: 2.3093615.
...
INFO:tensorflow:Restoring parameters from train_episode5/model.ckpt-185
INFO:tensorflow:Saving checkpoints for 186 into train_episode5/model.ckpt.
INFO:tensorflow:loss = 2.2975698, step = 186
INFO:tensorflow:batch_accuracy = 0.09375
INFO:tensorflow:global_step/sec: 3.47248
INFO:tensorflow:loss = 2.3078504, step = 286 (28.798 sec)
INFO:tensorflow:Saving checkpoints for 374 into train_episode5/model.ckpt.
INFO:tensorflow:Loss for final step: 2.290754.
...
INFO:tensorflow:Restoring parameters from train_episode5/model.ckpt-374
INFO:tensorflow:Saving checkpoints for 375 into train_episode5/model.ckpt.
INFO:tensorflow:loss = 2.2987142, step = 375
INFO:tensorflow:batch_accuracy = 0.140625
INFO:tensorflow:global_step/sec: 3.50966
INFO:tensorflow:loss = 2.0407405, step = 475 (28.493 sec)
INFO:tensorflow:Saving checkpoints for 560 into train_episode5/model.ckpt.
INFO:tensorflow:Loss for final step: 2.1280906.
...
INFO:tensorflow:Restoring parameters from train_episode5/model.ckpt-560
INFO:tensorflow:Saving checkpoints for 561 into train_episode5/model.ckpt.
INFO:tensorflow:loss = 2.0747793, step = 561
INFO:tensorflow:batch_accuracy = 0.203125
INFO:tensorflow:global_step/sec: 3.31447
INFO:tensorflow:loss = 2.1767468, step = 661 (30.171 sec)
INFO:tensorflow:Saving checkpoints for 740 into train_episode5/model.ckpt.
INFO:tensorflow:Loss for final step: 1.9530052.
...
INFO:tensorflow:Restoring parameters from train_episode5/model.ckpt-740
INFO:tensorflow:Saving checkpoints for 741 into train_episode5/model.ckpt.
INFO:tensorflow:loss = 1.9676144, step = 741
INFO:tensorflow:batch_accuracy = 0.296875
INFO:tensorflow:global_step/sec: 3.50441
INFO:tensorflow:loss = 1.8766258, step = 841 (28.536 sec)
INFO:tensorflow:Saving checkpoints for 930 into train_episode5/model.ckpt.
INFO:tensorflow:Loss for final step: 1.884157.
...
INFO:tensorflow:Restoring parameters from train_episode5/model.ckpt-930
INFO:tensorflow:Saving checkpoints for 931 into train_episode5/model.ckpt.
INFO:tensorflow:loss = 1.8624167, step = 931
INFO:tensorflow:batch_accuracy = 0.296875
INFO:tensorflow:global_step/sec: 3.30778
INFO:tensorflow:loss = 1.7580669, step = 1031 (30.232 sec)
INFO:tensorflow:Saving checkpoints for 1112 into train_episode5/model.ckpt.
INFO:tensorflow:Loss for final step: 1.9509349.
...

eval log:

INFO:tensorflow:Saving dict for global step 170: accuracy = 0.099306434, global_step = 170, loss = 2.3029344
INFO:tensorflow:Saving dict for global step 348: accuracy = 0.11751261, global_step = 348, loss = 2.2920265
INFO:tensorflow:Saving dict for global step 528: accuracy = 0.13106872, global_step = 528, loss = 2.5031097
INFO:tensorflow:Saving dict for global step 697: accuracy = 0.085986756, global_step = 697, loss = 30.668789
INFO:tensorflow:Saving dict for global step 871: accuracy = 0.10009458, global_step = 871, loss = 47931.96
...

accuracy round initial 0.1 and loss increase ridiculously !!!!

If training status set False in both stage:

train log: similar to above train log

eval log:

INFO:tensorflow:Saving dict for global step 185: accuracy = 0.1012768, global_step = 185, loss = 2.3037012
INFO:tensorflow:Saving dict for global step 374: accuracy = 0.10001576, global_step = 374, loss = 2.3124988
INFO:tensorflow:Saving dict for global step 560: accuracy = 0.20081967, global_step = 560, loss = 2.0881999
INFO:tensorflow:Saving dict for global step 740: accuracy = 0.26134932, global_step = 740, loss = 2.0297167
INFO:tensorflow:Saving dict for global step 930: accuracy = 0.26379257, global_step = 930, loss = 1.9529407
INFO:tensorflow:Saving dict for global step 1112: accuracy = 0.3454445, global_step = 1112, loss = 1.832811

Activity

CasiaFan

CasiaFan commented on Jan 29, 2018

@CasiaFan
Author

It seems that my BN layer usage is wrong. Based on Tensorflow tf.layers.batch_normalization() document,

Note: when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_optf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op

train_op should be included in with control_dependancy scope:

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
  train_op = optimizer.minimize(loss)

Modify upper script into following snippet and it works fine now!

...
optimizer = tf.train.RMSPropOptimizer(learning_rate=lr, decay=0.9)
# train optimizer
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
...
formigone

formigone commented on Mar 14, 2018

@formigone
Contributor

Could you post a working example of how to properly set the value for training in tf.layers.batch_normalization()?

As per the docs (TensorFlow 1.4):

training: Either a Python boolean, or a TensorFlow boolean scalar tensor (e.g. a placeholder). Whether to return the output in training mode (normalized with statistics of the current batch) or in inference mode (normalized with moving statistics). NOTE: make sure to set this parameter correctly, or else your training/inference will not work properly.

This is how I'm setting it, but clearly it's not behaving as expected in EVAL and PREDICT modes:

def model_fn(features, labels, mode, params):
  training = bool(mode == tf.estimator.ModeKeys.TRAIN)
  extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

  # ...

  x = tf.layers.batch_normalization(x, training=training)

 with tf.control_dependencies(extra_update_ops):
     train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())

tf-bn

In the graph above, the red and blue losses are for a network without batch_normalization. Then, using the same training and eval sets and the same architecture (plus batch_normalization), you can see that the pink training loss drops similar to how it did without BN, but the evaluation loss is way off.

Also, I'm graphing a histogram of the predicted values during training. They are always in a range of values (100, 300). Using the same training dataset to predict (estimator.predict()), the output is now in a range (15, 19). Way off. Looks like the moving_mean and moving_variance are not being saved correctly during training (or not used correctly during EVAL and PREDICT).

Thanks.

CasiaFan

CasiaFan commented on Mar 16, 2018

@CasiaFan
Author

@formigone I think you should put tf.control_dependencies() function inside your estimator TRAIN branch, for the moving average and variance are only need to update during training. Here is my example:

def model_fn(features, labels, mode, params):
    if mode == tf.estimator.ModeKeys.TRAIN:
        is_training = True
    else:
        is_training = False
    input_tensor = features["input"]
    logits = your_model_fn(input_tensor, is_training=is_training)
    probs = tf.nn.softmax(logits, name="output_score")
    predictions = tf.argmax(probs, axis=-1, name="output_label")
    # provide a tf.estimator spec for PREDICT
    predictions_dict = {"score": probs,
                        "label": predictions}
    if mode == tf.estimator.ModeKeys.PREDICT:
        predictions_output = tf.estimator.export.PredictOutput(predictions_dict)
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions_dict,
                                          export_outputs={
                                              tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY: predictions_output
                                          })
    # calculate loss
    onehot_labels = tf.one_hot(tf.cast(labels, tf.int32), _NUM_CLASSES)
    gamma = 1.5
    weights = tf.reduce_sum(tf.multiply(onehot_labels, tf.pow(1. - probs, gamma)), axis=-1)
    loss = tf.losses.softmax_cross_entropy(onehot_labels, logits, weights=weights)
    accuracy = tf.metrics.accuracy(labels=labels,
                                   predictions=predictions)
    if mode == tf.estimator.ModeKeys.TRAIN:
        lr = 0.001
        optimizer = tf.train.RMSPropOptimizer(learning_rate=lr, decay=0.9)
        update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
        with tf.control_dependencies(update_ops):
            train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
        tensors_to_log = {'batch_accuracy': accuracy[1],
                          'logits': logits,
                          'label': labels}
        logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=1000)
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          train_op=train_op,
                                          training_hooks=[logging_hook])
    else:
        eval_metric_ops = {"accuracy": accuracy}
        return tf.estimator.EstimatorSpec(mode=mode,
                                          loss=loss,
                                          eval_metric_ops=eval_metric_ops)
formigone

formigone commented on Jul 27, 2018

@formigone
Contributor

As an update to my issue, what I was doing wrong in the sample I posted is that I was getting update ops collection

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

before I had added a batch_normalization ops to my graph. The documentation on this has been updated since then, so this requirement is now more clearly described in the batch norm code documentation.

yell

yell commented on Sep 5, 2018

@yell

@formigone it might not be the case, but you can try lowering the momentum in batch_normalization(..., momentum=0.99, ...), to e.g. 0.9. The default can often be too "smooth", and as a result the validation metrics can be worse since the training forgets past values more slowly

kielnino

kielnino commented on Nov 7, 2018

@kielnino

If you are using a model inherited from tf.keras.Model then

extra_update_ops = model.get_updates_for(features)

will do the trick.

awerdich

awerdich commented on Nov 23, 2018

@awerdich

Could anyone give me an idea about how these tf.keras.layers updates can be accomplished by the EstimatorSpec in the estimator's model_fn? Without running these update ops, the tf.keras.BatchNormalization layer will not work properly with an estimator.

kielnino

kielnino commented on Nov 23, 2018

@kielnino

@awerdich That would be

def model_fn(features, labels, mode):

    model = KerasModel()

    training = (mode == tf.estimator.ModeKeys.TRAIN)
    y = model(features, training)

    predictions = {
        # Generate predictions (for PREDICT mode)
        "prob": y,
    }
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)

    # Calculate Loss (for both TRAIN and EVAL modes)

    loss = tf.reduce_mean(tf.keras.backend.binary_crossentropy(labels['y_true'], y))
    # Configure the Training Op (for TRAIN mode)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.MomentumOptimizer(learning_rate=1e-4, momentum=0.9)

        with tf.control_dependencies(model.get_updates_for(features)):
            train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())

        return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)

    # Add evaluation metrics (for EVAL mode)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss)
mmkamani7

mmkamani7 commented on Aug 13, 2019

@mmkamani7

This should be marked as the solution. That is the problem of Keras layers in tf estimator API that update ops are not attached to tf.GraphKeys.UPDATE_OPS and should be added using this line.

If you are using a model inherited from tf.keras.Model then

extra_update_ops = model.get_updates_for(features)

will do the trick.

isaacgerg

isaacgerg commented on Sep 14, 2019

@isaacgerg

Why is this marked as closed? I think this should be fixed so that tf.keras.models.Model() captures the batchnorm variables so that when model.fit() is called, these parameters are updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @formigone@yell@CasiaFan@isaacgerg@awerdich

        Issue actions

          Set training=True in BatchNormalization layer causes evaluation error in custom Estimator model · Issue #16455 · tensorflow/tensorflow