Skip to content

tf.nn.separable_conv2d is slower than conv2d on GPU #12940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Sep 9, 2017 · 19 comments
Closed

tf.nn.separable_conv2d is slower than conv2d on GPU #12940

ghost opened this issue Sep 9, 2017 · 19 comments
Labels

Comments

@ghost
Copy link

ghost commented Sep 9, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): TF 1.3
  • Python version: 3.6
  • Bazel version (if compiling from source):
  • CUDA/cuDNN version: CUDA8.0 /cuDNN6
  • GPU model and memory: GTX1080ti 11G

Describe the problem

In theory, separable_conv2d should be more efficient than conv2d, but when I test a simple model on Cifar10, the result shows that nn.separable_conv2d run slower on GPU, but is indeed faster on CPU.

Here is my test results on GPU:

training time for normal_conv after 2000 step: 8.18395892999979 sec
time for normal_conv after one forward step:  0.003980965999289765 sec
training time for separable_conv after 2000 step: 9.158266903999902 sec
time for separable_conv after one forward step:  0.0036441169995669043 sec

Source code / logs

Below is a fully self-contained example, I first define a model with two conv2d , than I define another model with one conv2d followed by one separable_conv2d, both model have 32 channels for each conv_layer and identical fc_layer.

import tensorflow as tf
import timeit
import numpy as np
from tensorflow.contrib.keras.python.keras.datasets.cifar10 import load_data

(x_train, y_train), (x_val, y_val) = load_data()
learning_rate = 0.001
num_steps = 1000
n_classes = 10
batch_size = 32

def reformat(labels):
    # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(n_classes) == labels[:,None]).astype(np.float32)
    return  labels.reshape(labels.shape[0],10)
train_labels = reformat(y_train)
tf.reset_default_graph()
x = tf.placeholder(tf.float32, [None, 32, 32, 3])
y = tf.placeholder(tf.float32, [None, 10])
weights1 = {}
weights2 = {}
dtype = tf.float32
with tf.name_scope('INIT_OP'):
    conv_initializer =  tf.contrib.layers.xavier_initializer_conv2d(dtype=dtype)
    fc_initializer =  tf.contrib.layers.xavier_initializer(dtype=dtype)

k = 3
kernel = 16

# Define weights for normal ConvNet
with tf.name_scope('VARIABLES_1'):
    weights1['conv1'] = tf.get_variable('conv1', [k, k, 3, kernel], initializer=conv_initializer, dtype=dtype, trainable=True)
    weights1['b1'] = tf.get_variable('b1', initializer=tf.zeros([kernel]))
    weights1['conv2'] = tf.get_variable('conv2', [k, k, kernel, kernel], initializer=conv_initializer, dtype=dtype, trainable=True)
    weights1['b2'] = tf.get_variable('b2', initializer=tf.zeros([kernel]))

    weights1['wd1'] = tf.get_variable('wd1', [8*8*kernel, 512], initializer=fc_initializer, dtype=dtype, trainable=True)
    weights1['bd1'] = tf.get_variable('bd1',  initializer=tf.zeros([512]) )
    weights1['wd2'] = tf.get_variable('wd2', [512, 10], initializer=fc_initializer, dtype=dtype, trainable=True)
    weights1['bd2'] = tf.get_variable('bd2',  initializer=tf.zeros([10]) )


#Define weights for separable ConvNet
with tf.name_scope('VARIABLES_sep'):
    weights2['conv1'] = tf.get_variable('2_conv1', [k, k, 3, kernel], initializer=conv_initializer, dtype=dtype, trainable=True)
    weights2['conv_dw2'] = tf.get_variable('conv_dw2', [k, k, kernel, 1], initializer=conv_initializer, dtype=dtype, trainable=True)
    weights2['conv_pw2'] = tf.get_variable('conv_pw2', [1, 1, kernel, kernel], initializer=conv_initializer, dtype=dtype, trainable=True)

    weights2['b1'] = tf.get_variable('2_b1', initializer=tf.zeros([kernel]))
    weights2['b2'] = tf.get_variable('2_b2', initializer=tf.zeros([kernel]))

    weights2['wd1'] = tf.get_variable('2_wd1', [8*8*kernel, 512], initializer=fc_initializer, dtype=dtype, trainable=True)
    weights2['bd1'] = tf.get_variable('2_bd1',  initializer=tf.zeros([512]) )
    weights2['wd2'] = tf.get_variable('2_wd2', [512, 10], initializer=fc_initializer, dtype=dtype, trainable=True)
    weights2['bd2'] = tf.get_variable('2_bd2',  initializer=tf.zeros([10]) )

def forward_conv_sep( inp, weights):
    hidden = conv_block(inp, weights2['conv1'], weights2['b1'])
    hidden = maxpool2d(hidden)
    hidden = conv_block_dw(hidden, weights2['conv_dw2'], weights2['conv_pw2'], weights2['b2'])
    hidden = maxpool2d(hidden)
    hidden = tf.reshape( hidden, [-1, np.prod([int(dim) for dim in hidden.get_shape()[1:]])] )
    fc1 = tf.matmul(hidden, weights2['wd1']) + weights2['bd1']
    fc1 = tf.nn.relu(fc1)
    return tf.matmul(fc1, weights2['wd2']) + weights2['bd2']

def forward_conv( inp, weights):
    hidden = conv_block(inp, weights1['conv1'], weights1['b1'])
    hidden = maxpool2d(hidden)
    hidden = conv_block(hidden, weights1['conv2'], weights1['b2'])
    hidden = maxpool2d(hidden)
    hidden = tf.reshape( hidden, [-1, np.prod([int(dim) for dim in hidden.get_shape()[1:]])] )
    fc1 = tf.matmul(hidden, weights1['wd1']) + weights1['bd1']
    fc1 = tf.nn.relu(fc1)
    return tf.matmul(fc1, weights1['wd2']) + weights1['bd2']


def conv_block_dw(inp, cweight_w, cweight_p, bweight):
    no_stride =  [1,1,1,1]
    conv_output = tf.nn.separable_conv2d(inp, cweight_w, cweight_p, no_stride, 'SAME') + bweight
    return tf.nn.relu(conv_output)

def conv_block(inp, cweight, bweight, activation=tf.nn.relu):
    no_stride =  [1,1,1,1]
    conv_output = tf.nn.conv2d(inp, cweight, no_stride, 'SAME') + bweight
    return tf.nn.relu(conv_output)

def maxpool2d(inp, k=2):
    return tf.nn.max_pool(inp, ksize=[1, k, k, 1], strides=[1, k, k, 1],
                          padding='SAME')

#logits for normal ConvNet
with tf.name_scope("forward_conv"):
    pred1 = forward_conv(x, weights1)

#Cost for normal ConvNet
with tf.name_scope("cost1"):
    loss1 = tf.nn.softmax_cross_entropy_with_logits(logits=pred1, labels=y)
    cost1 = tf.reduce_mean(loss1)

#training op for normal ConvNet
with tf.name_scope('train_op1'):
    train_op1 = tf.train.RMSPropOptimizer(learning_rate, 0.9).minimize(cost1)    

#logits for separable ConvNet
with tf.name_scope("forward_conv_sep"):
    pred2 = forward_conv_sep(x, weights2)

#Cost for separable ConvNet
with tf.name_scope("cost2"):
    loss2 = tf.nn.softmax_cross_entropy_with_logits(logits=pred2, labels=y)
    cost2 = tf.reduce_mean(loss2)

# training op for separable ConvNet
with tf.name_scope('train_op2'):
    train_op2 = tf.train.RMSPropOptimizer(learning_rate, 0.9).minimize(cost2)


with tf.name_scope('INIT'):
    init = tf.global_variables_initializer()


with tf.Session() as sess:

    sess.run(init)

    #train normal ConvNet for 2000 steps
    start = timeit.default_timer()
    for step in range(num_steps):
        r = np.random.choice(y_train.shape[0], batch_size, replace=False)
        batch_data = x_train[r]
        batch_labels = train_labels[r]

        feed_dict = {x : batch_data, y: batch_labels}
        _ , l = sess.run([train_op1,cost1], feed_dict=feed_dict)

    stop = timeit.default_timer()
    print ('training time for normal_conv after '+str(num_steps)+' step:',stop - start) 


    start = timeit.default_timer()
    feed_dict = {x : batch_data, y: batch_labels}
    predictions1 = sess.run(pred1, feed_dict=feed_dict)
    stop = timeit.default_timer()
    print ('time for normal_conv after one forward step: ',stop - start)



    # train separable ConvNet for 2000 steps
    start = timeit.default_timer()
    for step in range(num_steps):
        r = np.random.choice(y_train.shape[0], batch_size, replace=False)
        batch_data = x_train[r]
        batch_labels = train_labels[r]

        feed_dict = {x : batch_data, y: batch_labels}
        _ , l = sess.run([train_op2,cost2], feed_dict=feed_dict)


    stop = timeit.default_timer()
    print ('training time for sep_conv after '+str(num_steps)+' step:',stop - start) 

    start = timeit.default_timer()
    feed_dict = {x : batch_data, y: batch_labels}
    predictions = sess.run(pred2, feed_dict=feed_dict)
    stop = timeit.default_timer()
    print ('time for sep_conv after one forward step: ',stop - start)
@cy89 cy89 added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug labels Sep 14, 2017
@cy89
Copy link

cy89 commented Sep 14, 2017

@vrv my recollection is that there are some issues with backprop for separable convolutions on GPUs, which had been somewhat improved lately. Can you comment please on the state of the art?

@vrv
Copy link

vrv commented Sep 14, 2017

I'm not aware of any specific issues, kernels get faster as important models need them! In general, convolutions of different sizes will have different performance characteristics, and it's possible our separable convolution implementations may be slow for some combinations of shapes. I'm not sure whether that's the case here, but it could be. I also don't know whether theory matches practice here, since separable convolutions are less compute dense than normal convolutions. I believe the benefits you get are that you get to use fewer parameters to express a larger capacity convolution.

@carlthome
Copy link
Contributor

carlthome commented Sep 14, 2017

At what kernel sizes will convolutions be computed via FFT instead of directly? Anyway, a speedup by doing a separable convolution is more noticeable for larger kernels, so for small kernels the overhead involved in doing two convolutions might be larger than the speedup, especially for what I assume is a highly-optimized convolution setting with the 3x3 kernel (Winograd).

Essentially, for a [m, n] kernel it would take m*n calculations for a convolution and m+n calculations for the separable convolution, if I'm not mistaken.

@ghost
Copy link
Author

ghost commented Oct 5, 2017

@carlthome @vrv I do notice that when the number of filters get larger (128 or 192 etc. ), separable_conv2d is faster than conv2d, but in my case, I applied separable_conv2d on cifar10 with small number of filters, and it is actually slower than conv2d on my GPU, what could be the cause?

@carlthome
Copy link
Contributor

Same principle. For small kernels matrices the overhead involved in doing two convolutions might be larger than the speedup.

@ghost
Copy link
Author

ghost commented Oct 5, 2017

Understood, thanks for the explanation!

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

1 similar comment
@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

Nagging Awaiting TensorFlower: It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 23, 2018
@tensorflowbutler
Copy link
Member

A member of the TensorFlow organization has replied after the stat:awaiting tensorflower label was applied.

@HouseOfFinwe
Copy link

Why is this closed? I implemented UNet with separable conv2d and it was around 80% slower than using a standard conv2d. Is there anything in the works to optimize this like using groups?

@Niels-Sch
Copy link

I am getting immense slowdown as well, is there a more efficient implementation known?

@keunwoochoi
Copy link

Same problem. @HouseOfFinwe Did you figure out to fix this?

@Niels-Sch
Copy link

@keunwoochoi I did not, please let me know if either of you managed.

@keunwoochoi
Copy link

@HouseOfFinwe Ok thanks. tensorlayer/TensorLayer#416 (comment) says the speed issue was fixed in tf 1.5, but I'm having it with 1.13 now. Do you remember which versions have you tried?

@Niels-Sch
Copy link

Sadly I do not

@Uchiha-Shisui
Copy link

@HouseOfFinwe Ok thanks. tensorlayer/tensorlayer#416 (comment) says the speed issue was fixed in tf 1.5, but I'm having it with 1.13 now. Do you remember which versions have you tried?

Still exited on 1.14.

@liuheng92
Copy link

why close ??? still happen with tf1.13

@byronyi
Copy link
Contributor

byronyi commented Feb 10, 2020

This should be fixed by #33836. It currently requires tf-nightly but will ship in the coming 2.2 release.

copybara-service bot pushed a commit that referenced this issue May 27, 2024
Imported from GitHub PR openxla/xla#12940

Added additional params for `hlo_lit_tests` as a workaround, so `mi200.txtpb` would be used in `dot_bf16.hlo.test` for rocm.
Copybara import of the project:

--
c3bb3a7349266a51ff22a2e18dab0afb6e81bad4 by mmakevic <Milica.Makevic@amd.com>:

Have dot_bf16.hlo.test use mi200.txtpb for rocm

Merging this change closes #12940

Reverts 3d88ed1

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#12940 from ROCm:ci_dot_bf16 c3bb3a7349266a51ff22a2e18dab0afb6e81bad4
PiperOrigin-RevId: 637615795
copybara-service bot pushed a commit that referenced this issue May 27, 2024
Imported from GitHub PR openxla/xla#12940

Added additional params for `hlo_lit_tests` as a workaround, so `mi200.txtpb` would be used in `dot_bf16.hlo.test` for rocm.
Copybara import of the project:

--
c3bb3a7349266a51ff22a2e18dab0afb6e81bad4 by mmakevic <Milica.Makevic@amd.com>:

Have dot_bf16.hlo.test use mi200.txtpb for rocm

Merging this change closes #12940

Reverts changelist 637594837

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#12940 from ROCm:ci_dot_bf16 c3bb3a7349266a51ff22a2e18dab0afb6e81bad4
PiperOrigin-RevId: 637615795
copybara-service bot pushed a commit that referenced this issue May 27, 2024
Imported from GitHub PR openxla/xla#12940

Added additional params for `hlo_lit_tests` as a workaround, so `mi200.txtpb` would be used in `dot_bf16.hlo.test` for rocm.
Copybara import of the project:

--
c3bb3a7349266a51ff22a2e18dab0afb6e81bad4 by mmakevic <Milica.Makevic@amd.com>:

Have dot_bf16.hlo.test use mi200.txtpb for rocm

Merging this change closes #12940

Reverts changelist 637594837

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#12940 from ROCm:ci_dot_bf16 c3bb3a7349266a51ff22a2e18dab0afb6e81bad4
PiperOrigin-RevId: 637615795
MichaelHudgins pushed a commit that referenced this issue May 28, 2024
637889039  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Remove experimental_adaptive_avx_optimization flag from XNNPACK delegate options

    It's always on now.

--
637886275  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [XLA:GPU][IndexAnalysis] Use a flag for IsKnownEmpty instead of recomputing every time.

    Right now, we would try to simplify or compose with indexing maps that have a known empty domain. That's incorrect, but checking if the domain is empty every time is expensive and can be cached.

--
637876088  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Internal config change

--
637864812  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #13088: [ROCm] Fix reduce_atomic_min.hlo.test

    Imported from GitHub PR openxla/xla#13088

    Copybara import of the project:

    --
    b241e076198c03fffd8c7e3a6568070ef0223653 by mmakevic <Milica.Makevic@amd.com>:

    Fix reduce_atomic_min.hlo.test

    --
    f894f1954513019f0ca6890a27e09e0fee9d462e by mmakevic <Milica.Makevic@amd.com>:

    Remove extra space

    Merging this change closes #13088

--
637860531  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Remove xla_gpu_normalize_layouts flag.

    By now, this is really not experimental anymore.

--
637857834  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add heuristic for when to treat Gather ops as coalesced.

--
637820064  by A. Unique TensorFlower<gardener@tensorflow.org>:

    compat: Update forward compatibility horizon to 2024-05-28

--
637820063  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Update GraphDef version to 1876.

--
637756070  by A. Unique TensorFlower<gardener@tensorflow.org>:
    Automated rollback of changelist 636206934.

637674999  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [xla:cpu] Add initial support for Thunk-based execution to CpuCompiler and CpuExecutable

    Add support for compiling XLA:CPU HloModule to a ThunkSequence instead of a LLVM module and a jit-compiled function.

--
637666734  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Don't fuse inside computations that are already fused.

--
637657345  by A. Unique TensorFlower<gardener@tensorflow.org>:
    Automated rollback of changelist 636208997.

637651034  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Integrate LLVM at llvm/llvm-project@fddf350f9640

    Updates LLVM usage to match
    [fddf350f9640](llvm/llvm-project@fddf350f9640)

--
637639233  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #12940: [ROCm] Fix dot_bf16.hlo.test on ROCm

    Imported from GitHub PR openxla/xla#12940

    Added additional params for `hlo_lit_tests` as a workaround, so `mi200.txtpb` would be used in `dot_bf16.hlo.test` for rocm.
    Copybara import of the project:

    --
    c3bb3a7349266a51ff22a2e18dab0afb6e81bad4 by mmakevic <Milica.Makevic@amd.com>:

    Have dot_bf16.hlo.test use mi200.txtpb for rocm

    Merging this change closes #12940

--
637632492  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #13089: Fix reduce_large_row_to_scalar.hlo.test

    Imported from GitHub PR openxla/xla#13089

    Copybara import of the project:

    --
    ae97058c01ca57107a2566a6f190d51f5ad4ca0e by mmakevic <Milica.Makevic@amd.com>:

    Fix reduce_large_row_to_scalar.hlo.test

    Merging this change closes #13089

--
637623329  by A. Unique TensorFlower<gardener@tensorflow.org>:
    Automated rollback of changelist 637594837.

637607386  by A. Unique TensorFlower<gardener@tensorflow.org>:
    Automated rollback of changelist 636926669.

637594837  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [XLA:GPU] Pass CUDA_VERSION explicitly into CudnnFusedConvRewriter.

    Passing the CuDNN version will be the next step.

--
637580666  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Remove usage of --xla_gpu_enable_triton_hopper in autotuner

--
637578573  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [XLA:GPU] Add documentation about RTVars.

--
637570959  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Update GraphDef version to 1875.

--
637570942  by A. Unique TensorFlower<gardener@tensorflow.org>:

    compat: Update forward compatibility horizon to 2024-05-27

--
637561798  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #12979: [NVIDIA] Fix PGLE for latency estimation of p2p instructions

    Imported from GitHub PR openxla/xla#12979

    PGLE doesn't recognize p2p instruction such as send or recv as async operations.
    This adds the utility to check if instruction is a p2p communication instruction.
    Copybara import of the project:

    --
    469b2d31ff6b0270dda28f8754462681514d0e04 by TJ Xu <tjx@nvidia.com>:

    fix pgle not recognizing p2p instructions

    Merging this change closes #12979

--
637560035  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [xla:gpu] Track loop iteration counter of a WhileThunk in thread local variable

--
637552495  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #13056: Use `operator->` with XLA FFI Result Buffers in custom call docs

    Imported from GitHub PR openxla/xla#13056

    Copybara import of the project:

    --
    7940a1a02a0f93736a88406958edf62488bdbe19 by Andrey Portnoy <aportnoy@nvidia.com>:

    Use `operator->` with XLA FFI Result Buffers in custom call docs

    Merging this change closes #13056

--
637547404  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #13068: Introduce the Blackwell compute capability.

    Imported from GitHub PR openxla/xla#13068

    Introduce the Blackwell compute capability. Future Blackwell-specific changes can be guarded by this capability.
    Copybara import of the project:

    --
    cc1adebc95166b2d3979cc01de954a1895515ad4 by Dimitris Vardoulakis <dvardoulakis@nvidia.com>:

    Introduce the Blackwell compute capability. Future Blackwell-specific changes can be guarded by this capability.

    Merging this change closes #13068

--
637541058  by A. Unique TensorFlower<gardener@tensorflow.org>:

    PR #13061: Add Tirton support for XLA clamp

    Imported from GitHub PR openxla/xla#13061

    Add Triton support for XLA clamp instruction. Clamp is a common instruction found in FP8 fusions, and will be used in cuDNN fusions:

    This is a fix for perviously rolled-back PR due to internal ir_emitter_triton test failure: openxla/xla@d114ece

    cc @sergeykozub @sergachev
    Copybara import of the project:

    --
    3496ba2fa86571ab290e0881dd06400c415d80b6 by Elfie Guo <elfieg@nvidia.com>:

    Add Tirton support for XLA clamp.

    Merging this change closes #13061

--
637366630  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Update GraphDef version to 1874.

--
637366295  by A. Unique TensorFlower<gardener@tensorflow.org>:

    compat: Update forward compatibility horizon to 2024-05-26

--
637185396  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Automated Code Change

--
637168744  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Update GraphDef version to 1873.

--
637168421  by A. Unique TensorFlower<gardener@tensorflow.org>:

    compat: Update forward compatibility horizon to 2024-05-25

--
637166714  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Attempt loading libOpenCL.so before libOpenCL-pixel.so

--
637137789  by A. Unique TensorFlower<gardener@tensorflow.org>:

    feat: Implement hermetic Python version matching system Python version

--
637102058  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [IFRT] Add xla::ifrt::Sharding::IsFullyReplicated()

    IFRT Sharding type gains `IsFullyReplicated()`, which quickly tells if the
    sharding represents a fully-replicated sharding.

    The main motivation is to make full replication information queriable at IFRT
    shardings and prepare for enabling IFRT implementations to handle full
    replication directly.

    There are a preset of rules:

    * `SingleDeviceSharding` is trivially fully replicated by its definition.
    * `ConcreteSharding` and `OpaqueSharding` is not fully replicated. They have special cases where it may be fully replicated, but the user is advised to use a more specific sharding type to represent such cases.
    * `ConcreteEvenSharding` may/may not fully replicated. This is controlled at creation time.
    * `ShardingParamSharding` and (IFRT) `HloSharding` depend on whether their lower-level sharding represents full replication.

    `ConcreteEvenSharding` is a noteworthy case where the full replication information
    does not come from the existing source of the information. This is because the
    creators of this sharding (e.g., JAX) typically has the information, but the
    replication information is lost when coercing it into `ConcreteEvenSharding`.
    This problem will be gradually less problematic once JAX uses a higher-level
    IFRT sharding type (mainly (IFRT) `HloSharding`) at more places.

    This change extends the `Sharding` type, but the new method is not used by any
    existing code.

--
637097325  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Ensure delegates properly delegate models

--
637080761  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add barrier logs.

--
637070664  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Clean up include and build file

--
637069670  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Use the `LoadedClientGraph`'s copy of `FunctionLibraryDefinition` instead of getting it from the `FallbackState` in the parent `GraphExecutor`

--
637069442  by A. Unique TensorFlower<gardener@tensorflow.org>:

    update doc ref

--
637061122  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Refactor exhaustive testing of unary float32 functions into a library.

--
637046941  by A. Unique TensorFlower<gardener@tensorflow.org>:

    fix profile_util's compatible_with tag typo

--
637028365  by A. Unique TensorFlower<gardener@tensorflow.org>:

    [XLA] Refactor HostOffloader.

    Change HostOffloader's algorithm for identifying host memory offloading. This approach supports every conceivable host memory offloading pattern (as of today).

--
637023690  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Simplify volumes for docker container in XLA build script

--
637018892  by A. Unique TensorFlower<gardener@tensorflow.org>:

    move flatbuffer_compatibility_test target to tflite compiler

--
637008187  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add copyright notice to profiler_utils.cc

--
636990162  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Adds a proto profile summary formatter to the TFLite benchmark.
    Adds a Python script to convert benchmark profile protos to a JSON consumable by the model-explorer.

--
636976463  by A. Unique TensorFlower<gardener@tensorflow.org>:

    Add profiler_util to enable flexibly tpu profiler registration for different purposes

--

PiperOrigin-RevId: 637889039
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants